Background

Libraries

library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
  method         from
  print.tbl_lazy     
  print.tbl_sql      
── Attaching packages ────────────────────────────── tidyverse 1.3.0 ──
✓ ggplot2 3.3.2     ✓ purrr   0.3.4
✓ tibble  3.0.3     ✓ dplyr   1.0.2
✓ tidyr   1.1.2     ✓ stringr 1.4.0
✓ readr   1.4.0     ✓ forcats 0.5.0
── Conflicts ───────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
library(coxed)
Loading required package: rms
Loading required package: Hmisc
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Registered S3 method overwritten by 'htmlwidgets':
  method           from         
  print.htmlwidget tools:rstudio
Registered S3 method overwritten by 'data.table':
  method           from
  print.data.table     

Attaching package: ‘Hmisc’

The following objects are masked from ‘package:dplyr’:

    src, summarize

The following objects are masked from ‘package:base’:

    format.pval, units

Loading required package: SparseM

Attaching package: ‘SparseM’

The following object is masked from ‘package:base’:

    backsolve

Loading required package: mgcv
Loading required package: nlme

Attaching package: ‘nlme’

The following object is masked from ‘package:dplyr’:

    collapse

This is mgcv 1.8-31. For overview type 'help("mgcv-package")'.
library(survival)
library(randomForestSRC)

 randomForestSRC 2.9.3 
 
 Type rfsrc.news() to see new features, changes, and bug fixes. 
 


Attaching package: ‘randomForestSRC’

The following object is masked from ‘package:Hmisc’:

    impute

The following object is masked from ‘package:purrr’:

    partial
library(ggRandomForests)

Attaching package: ‘ggRandomForests’

The following object is masked from ‘package:randomForestSRC’:

    partial.rfsrc
# Akima is simply a supplementary library to support the graphing of tuning data from randomForestSRC
library(akima)

Introduction

RSF and justification

The random survival forest (RSF) method was recently developed as a powerful machine learning tool for predicting survival, primarily in clinical populations. RSF extends traditional random forests (Breiman) to model cumulative hazard functions for individual patients using an ensemble of decision trees (Ishwaran), with each prediction being informed by the cumulative hazard function of a training population. These models represent a powerful alternative to traditional Cox proportional hazards models and time-to-event analyses.

RSFs have been implemented in several different contexts in both R and python. Both RandomForestSRC (https://cran.r-project.org/web/packages/randomForestSRC/citation.html) and ranger (https://cran.r-project.org/web/packages/ranger/citation.html) are common R-based implementations of RSF, and python-based RSF has recently been implemented in the scikit-survival package (https://scikit-survival.readthedocs.io/en/latest/cite.html). These implementations have been benchmarked on a variety of data as well as being independently supported by third-party research findings.

However, despite intense interest and use, few comprehensive benchmarking studies have been performed on these models. Most third-party benchmarking studies of RSF are focused on a few key aspects of the implementation or data types (for example, https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-017-0383-8). Furthermore, most benchmarking studies are based on a presumption of a particular type or quality level of real-world clinical data, which represents a relatively narrow use case and makes it difficult to fully understand the applicability of RSF methods to common clinical data scenarios such as unbalanced data or small sample sizes. These challenges are exacerbated by the wealth of tuning options for RSF implementations (RandomForestSRC calls have 34 listed arguments, and ranger calls have 38, although some of those arguments are unlikely to affect predictive performance). The combination of current knowledge gaps and a large number of prospective tuning options makes it nearly impossible to adequately optimize random survival forests for a given real-world use case without detailed background knowledge of model behavior. The goal of this study is to perform RSF benchmarking using simulated gene and survival data.

Survival data simulation details

Harden (2018) is critical to this endeavor. He describes a relatively elegant solution to simulating survival data that builds the survival and hazard functions in a stepwise manner. First it randomly selects timepoints along the desired duration, and then it randomly selects the same number of values in the [0,1] interval, with 0 and 1 serving as anchor points at time 0 and the max time. These Y values are sorted and then assigned to each chosen timepoint. A cubic spline is constructed along the points using Hyman’s method (1983) to ensure montonicity. This represents the cumulative density function (CDF) of the hazard model. The probability density function (PDF) is constructed from the differences between each point . The survival function can be relatively easily constructed by inverting the CDF. Finally, the baseline hazard function is computed by dividing the failure PDF from the survival function (See Fig 1 in Harden).

Covariates are randomly generated (or given by the analyst), and then “true” coefficients are randomly generated as well (these can also be generated by the analyst, which will allow me to set uninformative variables). The combination of covariate and coefficient values represent a linear predictor. This linear predictor is used to calculate exp(XBeta), which represents the percent change from the baseline survival function for a given observation (e.g., 50% lower at all points). Then you randomly select a value from U[0,1] and find the timepoint when the calculated survival function drops below that value. This represents the duration for a given patient. Censoring is calculated independently based on the presumption that censorship mechanisms are not linked to the data generating process (Jackson 2014), which is a reasonable assumption.

Plan

Potential data features

Variable types

  • Binary (e.g., mutation)
    • Roughly 1000 variables, maybe 500 (corresponds with Illumina TruSight 500 panel)
  • Ordinal (e.g., severity or stage)
    • 10 (ordinal data are relatively uncommon)
  • Unordered factor (e.g., race)
    • 10 UF data are also relatively uncommon
  • Continuous percentage (e.g., VAF)
    • Roughly 1000 variables, maybe 500
  • Continuous larger numbers with log-norm distribution (e.g., TPM; https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1240081/)
    • Ask Brian C
  • Continuous larger numbers with norm distribution (e.g., WBC)
    • 100 variables

Percentage of informative variables

  • All (100%)
  • High (75%)
  • Moderate (50%)
  • Low (25%)
  • Very few (5%)

Data balance

  • Balanced
  • Minor imbalance pos (60/40)
  • Minor imbalance neg (40/60)
  • Major imbalance pos (75/25)
  • Major imbalance neg (25/75)
  • Rare events pos (95/5)
  • Rare events neg (5/95)

Data avail

Log scale values from lots to little

exp(seq(log(50), log(1000), length.out = 10))
 [1]   50.00000   69.74754   97.29439  135.72088  189.32395  264.09760  368.40315
 [8]  513.90427  716.87116 1000.00000

Seems reasonable to use: 50, 70, 97, 136, 189, 264, 368, 514, 717, 1000

Simulation needs

2000 samples * Cite Hendry (2014)?; https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.5945 * 1000 samples 0 event * 1000 samples 1 event * Can use (https://cran.r-project.org/web/packages/coxed/vignettes/simulating_survival_data.html) to simulate data

Select a random number of each type of sample to equal the percentages given in the above log list. This represents 7 possibilities across 10 data size categories or a matrix of 70 datasets. Then we should select the number of features to be reasonable for a given type of data.

Start the simulations with homogeneous data types representing “typical” variable numbers: * Binary * Continuous percentage * Continuous log-norm

Each have the 5 levels of informative data * Randomly filtered * Uninformative removal

With this list of combinations, I’m simulating 1050 different data sets. Rather, I’m combining some smaller amount of simulated data into 1050 different combinations.

Methods

Generic plotting function for randomForestSRC

plot.tune <- function(o, linear = TRUE) {
  x <- o$results[,1]
  y <- o$results[,2]
  z <- o$results[,3]
  so <- interp(x=x, y=y, z=z, linear = linear)
  idx <- which.min(z)
  x0 <- x[idx]
  y0 <- y[idx]
  filled.contour(x = so$x,
                 y = so$y,
                 z = so$z,
                 xlim = range(so$x, finite = TRUE) + c(-2, 2),
                 ylim = range(so$y, finite = TRUE) + c(-2, 2),
                 color.palette =colorRampPalette(c("blue", "orange")),
                 xlab = "nodesize",
                 ylab = "mtry",
                 main = "OOB error for nodesize and mtry in tune.rfrsc",
                 key.title = title(main = "OOB error", cex.main = 1),
                 plot.axes = {axis(1); axis(2); points(x0,y0,pch="x",cex=1,font=2); points(x,y,pch=16,cex=.25)})
  }

Initial testing and scoping

To start with the project and learning how to use coxed, let’s focus on just using binary variables. In this case, the simulation will be:

Simulation Parameter Value
Sample number 2000
Case balance Balanced (50/50)
Duration 365 days
Variable type Normal
Variable number 500
Percentage inform Norm Dist Coeff

coxed Tutorial

tutorial <- sim.survdata(N=2000, T=365, num.data.frames = 1, censor=0.5, xvars = 500)
head(tutorial$data, 10)
summary(tutorial$data$failed)

ggplot(tutorial$data) + geom_histogram(aes(x=X1)) + ggtitle("Distribution of first sim var")
ggplot() + geom_histogram(aes(x=tutorial$betas)) + ggtitle("Distribution of 'true' coefficients of simulated data")
tutorial_cox <- coxph(Surv(y, failed) ~ ., data=tutorial$data)

These results seem to meet most of our needs. Now we should try it with the types of data we want. We’ll generate our own X, which is the covariates, and allow the model to randomly assign coefficients. We can do this 5 times as a representative example and plot the various hazard functions.

Genetic data generation

Binary data matrix

2000 observations, each with 500 binary variables, representing genetic data taken from a large cancer screen (Illumina TruSight is roughly 500 genes)

Simulation Parameter Value
Sample number 2000
Case balance Balanced (50/50)
Duration 1825 days (5 years)
Variable type Binary
Variable number 500
Percentage inform Norm Dist (Random)

Variable generator function

# This function generates a binary data matrix, as might be seen in mutation hotspot data or sequencing panels. var_freq is the integer value of percent positivity in the sample. It can be a single value, in which case the matrix has the same prevalence for every variable, or it can have 2 values, representing the range of possible frequencies. for frequency ranges, the columns are named by their generated freq.
binary_generator <- function(samp_num, var_num, var_freq = 50) {
  if(length(var_freq) == 1){
    message('binary_generator: Fixed frequency mode, fast generation. Pos freq = ',var_freq)
    data_mat <- as.data.frame(matrix(sample(0:1, samp_num * var_num, replace = TRUE, prob = c(1-var_freq/100, var_freq/100)), samp_num, var_num))
  }
  if(length(var_freq) == 2){
    message('binary_generator: Variable frequency mode. Pos freq = ', var_freq[1], ":", var_freq[2])
    freq_matrix <- sample(var_freq[1]:var_freq[2], var_num, replace=TRUE)/100
    for(i in seq_along(freq_matrix)){
      if(i == 1){
        data_mat <- as.data.frame(sample(0:1, samp_num, replace = TRUE, prob = c(1-freq_matrix[i], freq_matrix[i])))
        colnames(data_mat) <- paste0("F",i,"_",freq_matrix[i])
      } else {
        temp <- data.frame(col_tmp = sample(0:1, samp_num, replace = TRUE, prob = c(1-freq_matrix[i], freq_matrix[i])))
        colnames(temp) <- paste0("F",i,"_",freq_matrix[i])
        data_mat <- cbind(data_mat, temp)
      }
    }
  }  
  data_mat
}

Generate initially scoped data matrix

set.seed(12345)
gene_panel_input <- binary_generator(samp_num = 2000, var_num = 500)
head(gene_panel_input)

Generate 5 simulated survival datasets

sim_genepanel <- sim.survdata(T=1825, num.data.frames = 5, censor=0.5, X=gene_panel_input, knots=37)

In this case, the knots are quite variable and definitely affect the distributions of times. If the number of knots is too low, it results in very punctate simulated durations with many durations clustered around a few timepoints. In the case of large time scales (e.g., simulated 5-year survival), T/50 seems to be a reasonable approximation leading to a good variety of data outputs.

Diagnostics for simulated data

survsim.plot(sim_genepanel, df=1)
TableGrob (2 x 1) "arrange": 2 grobs

survsim.plot(sim_genepanel, df=2)
TableGrob (2 x 1) "arrange": 2 grobs

survsim.plot(sim_genepanel, df=3)
TableGrob (2 x 1) "arrange": 2 grobs

survsim.plot(sim_genepanel, df=4)
TableGrob (2 x 1) "arrange": 2 grobs

survsim.plot(sim_genepanel, df=5)
TableGrob (2 x 1) "arrange": 2 grobs

coxph(Surv(y, failed) ~ ., data=sim_genepanel[[1]]$data)
Call:
coxph(formula = Surv(y, failed) ~ ., data = sim_genepanel[[1]]$data)

           coef  exp(coef)   se(coef)      z        p
V1    0.1486342  1.1602485  0.0863574  1.721 0.085223
V2   -0.0354240  0.9651961  0.0844782 -0.419 0.674977
V3   -0.0884946  0.9153081  0.0840147 -1.053 0.292193
V4   -0.0541214  0.9473171  0.0862251 -0.628 0.530217
V5    0.1879883  1.2068193  0.0833714  2.255 0.024144
V6   -0.0377495  0.9629541  0.0848507 -0.445 0.656397
V7   -0.0390525  0.9617003  0.0839031 -0.465 0.641611
V8   -0.0084183  0.9916171  0.0859773 -0.098 0.922002
V9   -0.3900879  0.6769973  0.0834899 -4.672 2.98e-06
V10  -0.0481587  0.9529826  0.0842667 -0.572 0.567659
V11   0.1293773  1.1381195  0.0872535  1.483 0.138134
V12  -0.1406273  0.8688130  0.0842023 -1.670 0.094897
V13  -0.0339935  0.9665778  0.0848140 -0.401 0.688567
V14  -0.0767464  0.9261247  0.0826944 -0.928 0.353370
V15  -0.1465369  0.8636938  0.0852094 -1.720 0.085482
V16  -0.0312147  0.9692674  0.0815506 -0.383 0.701894
V17   0.0962477  1.1010317  0.0861132  1.118 0.263700
V18   0.1680433  1.1829879  0.0831917  2.020 0.043388
V19   0.3480665  1.4163264  0.0865583  4.021 5.79e-05
V20  -0.1698071  0.8438276  0.0860360 -1.974 0.048419
V21  -0.0021137  0.9978886  0.0835002 -0.025 0.979805
V22   0.1566757  1.1696162  0.0827739  1.893 0.058382
V23   0.1065197  1.1123999  0.0813335  1.310 0.190309
V24   0.0255362  1.0258650  0.0824635  0.310 0.756815
V25   0.1136192  1.1203254  0.0839664  1.353 0.176008
V26  -0.0088560  0.9911831  0.0866141 -0.102 0.918561
V27  -0.1058966  0.8995176  0.0838760 -1.263 0.206755
V28  -0.1108017  0.8951162  0.0844921 -1.311 0.189727
V29  -0.2080581  0.8121598  0.0859711 -2.420 0.015516
V30   0.0235089  1.0237874  0.0855628  0.275 0.783504
V31  -0.3263422  0.7215582  0.0852123 -3.830 0.000128
V32  -0.3101892  0.7333082  0.0843952 -3.675 0.000237
V33   0.0162939  1.0164273  0.0823181  0.198 0.843094
V34  -0.2947337  0.7447299  0.0859312 -3.430 0.000604
V35  -0.1478983  0.8625189  0.0852696 -1.734 0.082833
V36  -0.0189479  0.9812304  0.0838954 -0.226 0.821316
V37  -0.0756390  0.9271508  0.0858557 -0.881 0.378317
V38  -0.1044100  0.9008558  0.0838271 -1.246 0.212933
V39   0.1189207  1.1262806  0.0820847  1.449 0.147406
V40   0.0756070  1.0785387  0.0854963  0.884 0.376518
V41   0.0643927  1.0665111  0.0821866  0.783 0.433338
V42   0.0112816  1.0113455  0.0837962  0.135 0.892903
V43  -0.0184189  0.9817497  0.0847696 -0.217 0.827989
V44   0.0383170  1.0390606  0.0853511  0.449 0.653479
V45   0.2876861  1.3333387  0.0821088  3.504 0.000459
V46   0.2126041  1.2368949  0.0855287  2.486 0.012927
V47  -0.1206006  0.8863879  0.0859408 -1.403 0.160528
V48  -0.3074794  0.7352980  0.0835729 -3.679 0.000234
V49   0.1264148  1.1347528  0.0836354  1.511 0.130661
V50   0.1765794  1.1931291  0.0858356  2.057 0.039669
V51  -0.1134322  0.8927647  0.0839621 -1.351 0.176698
V52  -0.2393011  0.7871778  0.0857847 -2.790 0.005278
V53  -0.0305148  0.9699461  0.0831750 -0.367 0.713712
V54  -0.3026666  0.7388454  0.0846137 -3.577 0.000348
V55  -0.0571166  0.9444839  0.0867471 -0.658 0.510264
V56   0.2524180  1.2871340  0.0835161  3.022 0.002508
V57  -0.1130422  0.8931129  0.0865373 -1.306 0.191456
V58   0.2513062  1.2857037  0.0839041  2.995 0.002743
V59  -0.0310206  0.9694556  0.0834361 -0.372 0.710050
V60  -0.1031899  0.9019557  0.0832440 -1.240 0.215121
V61  -0.2846477  0.7522793  0.0837128 -3.400 0.000673
V62   0.2251568  1.2525191  0.0843302  2.670 0.007586
V63   0.0686127  1.0710213  0.0837494  0.819 0.412637
V64   0.3590914  1.4320277  0.0841352  4.268 1.97e-05
V65  -0.0932409  0.9109740  0.0827213 -1.127 0.259671
V66   0.3175639  1.3737770  0.0846114  3.753 0.000175
V67   0.2699825  1.3099415  0.0815636  3.310 0.000933
V68  -0.1411070  0.8683964  0.0857070 -1.646 0.099684
V69   0.0590731  1.0608528  0.0838185  0.705 0.480951
V70  -0.0678715  0.9343805  0.0851933 -0.797 0.425639
V71  -0.2905642  0.7478415  0.0843797 -3.444 0.000574
V72  -0.2917924  0.7469236  0.0851875 -3.425 0.000614
V73  -0.0778418  0.9251108  0.0840672 -0.926 0.354473
V74   0.1174806  1.1246598  0.0867389  1.354 0.175603
V75   0.1942031  1.2143429  0.0853483  2.275 0.022881
V76  -0.2533038  0.7762321  0.0853514 -2.968 0.003000
V77  -0.0302183  0.9702337  0.0855287 -0.353 0.723854
V78   0.1288001  1.1374628  0.0841035  1.531 0.125659
V79  -0.2214136  0.8013851  0.0845781 -2.618 0.008848
V80  -0.1044231  0.9008441  0.0825042 -1.266 0.205631
V81  -0.0123214  0.9877542  0.0833124 -0.148 0.882426
V82   0.1086808  1.1148064  0.0845011  1.286 0.198392
V83   0.3111449  1.3649870  0.0835449  3.724 0.000196
V84  -0.1140675  0.8921977  0.0868622 -1.313 0.189115
V85   0.1813824  1.1988735  0.0873044  2.078 0.037748
V86  -0.1460385  0.8641244  0.0854848 -1.708 0.087570
V87   0.0519035  1.0532741  0.0852792  0.609 0.542769
V88  -0.1838521  0.8320589  0.0832964 -2.207 0.027300
V89  -0.3135483  0.7308491  0.0854648 -3.669 0.000244
V90  -0.0362673  0.9643825  0.0859066 -0.422 0.672901
V91   0.1141745  1.1209477  0.0844270  1.352 0.176265
V92   0.1309449  1.1399049  0.0815537  1.606 0.108356
V93   0.1099841  1.1162603  0.0857327  1.283 0.199537
V94   0.1284568  1.1370723  0.0842744  1.524 0.127441
V95  -0.1307340  0.8774511  0.0866614 -1.509 0.131411
V96  -0.0057871  0.9942296  0.0846630 -0.068 0.945503
V97  -0.0037213  0.9962856  0.0850356 -0.044 0.965094
V98   0.0167808  1.0169224  0.0827861  0.203 0.839369
V99  -0.1133870  0.8928051  0.0840372 -1.349 0.177257
V100  0.2673911  1.3065514  0.0863100  3.098 0.001948
V101 -0.1594188  0.8526392  0.0845023 -1.887 0.059219
V102  0.0762819  1.0792668  0.0829208  0.920 0.357606
V103 -0.0036048  0.9964016  0.0840522 -0.043 0.965791
V104  0.0390891  1.0398632  0.0829131  0.471 0.637322
V105 -0.2400401  0.7865963  0.0849122 -2.827 0.004700
V106  0.2281063  1.2562188  0.0841038  2.712 0.006684
V107  0.0087734  1.0088120  0.0873360  0.100 0.919982
V108  0.0080907  1.0081235  0.0840137  0.096 0.923281
V109 -0.1693014  0.8442544  0.0839635 -2.016 0.043762
V110  0.0882410  1.0922513  0.0855654  1.031 0.302414
V111 -0.1023574  0.9027069  0.0856514 -1.195 0.232069
V112  0.1599136  1.1734094  0.0845438  1.891 0.058559
V113  0.0609890  1.0628872  0.0847685  0.719 0.471847
V114  0.1446621  1.1556490  0.0854424  1.693 0.090437
V115 -0.2103270  0.8103192  0.0811626 -2.591 0.009558
V116  0.1663666  1.1810059  0.0836255  1.989 0.046654
V117 -0.0586392  0.9430469  0.0836938 -0.701 0.483528
V118  0.0570307  1.0586883  0.0837915  0.681 0.496108
V119  0.0759600  1.0789195  0.0830801  0.914 0.360560
V120  0.1536274  1.1660564  0.0844711  1.819 0.068958
V121 -0.0873896  0.9163200  0.0859186 -1.017 0.309096
V122  0.1180612  1.1253129  0.0867818  1.360 0.173692
V123  0.0636690  1.0657396  0.0861071  0.739 0.459654
V124 -0.2164929  0.8053383  0.0828623 -2.613 0.008983
V125  0.1002267  1.1054215  0.0866660  1.156 0.247489
V126  0.0193491  1.0195376  0.0853871  0.227 0.820731
V127  0.4245399  1.5288869  0.0854719  4.967 6.80e-07
V128  0.3786220  1.4602709  0.0846222  4.474 7.67e-06
V129 -0.1534514  0.8577424  0.0859909 -1.785 0.074341
V130  0.2289733  1.2573085  0.0840392  2.725 0.006438
V131  0.1803898  1.1976842  0.0848877  2.125 0.033583
V132 -0.0464125  0.9546481  0.0839830 -0.553 0.580509
V133  0.0452568  1.0462965  0.0812987  0.557 0.577751
V134 -0.0756405  0.9271494  0.0860847 -0.879 0.379577
V135 -0.0131367  0.9869492  0.0841885 -0.156 0.876002
V136 -0.0295872  0.9708462  0.0817801 -0.362 0.717509
V137 -0.0501317  0.9511041  0.0819621 -0.612 0.540773
V138  0.0832597  1.0868240  0.0848792  0.981 0.326633
V139  0.1703282  1.1856939  0.0833443  2.044 0.040986
V140  0.3139525  1.3688247  0.0838008  3.746 0.000179
V141 -0.0009805  0.9990200  0.0844841 -0.012 0.990740
V142 -0.0496948  0.9515198  0.0836911 -0.594 0.552654
V143  0.2251140  1.2524654  0.0871611  2.583 0.009802
V144 -0.0158252  0.9842994  0.0862949 -0.183 0.854496
V145  0.2018532  1.2236684  0.0847107  2.383 0.017179
V146 -0.0807279  0.9224446  0.0863636 -0.935 0.349920
V147 -0.1137336  0.8924957  0.0828352 -1.373 0.169749
V148 -0.0497820  0.9514369  0.0854720 -0.582 0.560273
V149  0.2231541  1.2500132  0.0848732  2.629 0.008557
V150 -0.1471239  0.8631870  0.0836353 -1.759 0.078558
V151 -0.0373785  0.9633115  0.0856759 -0.436 0.662635
V152 -0.0699428  0.9324471  0.0853988 -0.819 0.412778
V153  0.0856027  1.0893735  0.0835854  1.024 0.305771
V154 -0.1400076  0.8693516  0.0863777 -1.621 0.105044
V155  0.2957059  1.3440748  0.0846122  3.495 0.000474
V156 -0.1758926  0.8387081  0.0830312 -2.118 0.034142
V157  0.0137236  1.0138182  0.0851866  0.161 0.872014
V158 -0.0826339  0.9206882  0.0848058 -0.974 0.329863
V159  0.0358839  1.0365355  0.0817198  0.439 0.660583
V160  0.0597731  1.0615957  0.0854582  0.699 0.484275
V161  0.0292483  1.0296802  0.0836749  0.350 0.726679
V162 -0.1571489  0.8545768  0.0837759 -1.876 0.060679
V163 -0.0047166  0.9952945  0.0844080 -0.056 0.955439
V164 -0.1721635  0.8418415  0.0856224 -2.011 0.044354
V165  0.3926218  1.4808582  0.0829977  4.731 2.24e-06
V166  0.1211403  1.1287833  0.0852377  1.421 0.155257
V167  0.1944321  1.2146210  0.0856207  2.271 0.023156
V168  0.0906548  1.0948910  0.0855188  1.060 0.289119
V169 -0.1067744  0.8987284  0.0825291 -1.294 0.195742
V170 -0.0792935  0.9237687  0.0852183 -0.930 0.352125
V171 -0.0330605  0.9674801  0.0848120 -0.390 0.696678
V172 -0.0060884  0.9939301  0.0839964 -0.072 0.942217
V173 -0.0130213  0.9870631  0.0859523 -0.151 0.879585
V174  0.2040660  1.2263791  0.0841461  2.425 0.015303
V175  0.1485215  1.1601177  0.0858097  1.731 0.083483
V176  0.1939011  1.2139762  0.0855668  2.266 0.023447
V177 -0.0677650  0.9344801  0.0842531 -0.804 0.421222
V178 -0.0791445  0.9239064  0.0830965 -0.952 0.340873
V179  0.1146784  1.1215127  0.0876650  1.308 0.190824
V180 -0.1565588  0.8550813  0.0857605 -1.826 0.067920
V181 -0.1838288  0.8320782  0.0875280 -2.100 0.035709
V182 -0.0793758  0.9236927  0.0842282 -0.942 0.345993
V183 -0.0228126  0.9774456  0.0833171 -0.274 0.784235
V184  0.1280020  1.1365553  0.0804727  1.591 0.111694
V185  0.2713933  1.3117910  0.0840059  3.231 0.001235
V186  0.0455718  1.0466261  0.0847752  0.538 0.590881
V187  0.1577598  1.1708849  0.0840427  1.877 0.060499
V188 -0.1552445  0.8562058  0.0840062 -1.848 0.064601
V189  0.0404336  1.0412622  0.0851716  0.475 0.634978
V190 -0.2387678  0.7875977  0.0848221 -2.815 0.004879
V191 -0.1191870  0.8876418  0.0873560 -1.364 0.172447
V192  0.0189360  1.0191164  0.0881187  0.215 0.829851
V193 -0.1087368  0.8969665  0.0836725 -1.300 0.193754
V194 -0.1777097  0.8371854  0.0844269 -2.105 0.035301
V195 -0.0169348  0.9832078  0.0822208 -0.206 0.836816
V196 -0.0579691  0.9436791  0.0822044 -0.705 0.480697
V197 -0.2243788  0.7990124  0.0846281 -2.651 0.008017
V198  0.0963215  1.1011130  0.0850000  1.133 0.257133
V199 -0.0673742  0.9348453  0.0832658 -0.809 0.418431
V200  0.1152406  1.1221434  0.0851836  1.353 0.176104
 [ reached getOption("max.print") -- omitted 300 rows ]

Likelihood ratio test=1296  on 500 df, p=< 2.2e-16
n= 2000, number of events= 1015 
head(sim_genepanel[[1]]$betas, 5)
            [,1]
[1,]  0.12900508
[2,] -0.14028536
[3,] -0.08893672
[4,] -0.05643506
[5,]  0.03252168

More realistic binary data matrix

This is very similar to before, but the variables have a more realistic distribution in terms of frequency (ranging from 5% to 40% of samples). We’re going to cut down the sample sizes a bit too because those were really big models that took way too long previously.

Simulation Parameter Value
Sample number 1000
Case balance Balanced (50/50)
Duration 1825 days (5 years)
Variable type Binary
Variable number 300
Percentage inform Norm Dist (Random)
set.seed(12345)
realistic_panel_input <- binary_generator(samp_num = 1000, var_num = 300, var_freq = c(5,40))
binary_generator: Variable frequency mode. Pos freq = 5:40
head(realistic_panel_input)

Generate 5 simulated survival datasets

sim_realpanel <- sim.survdata(T=1825, num.data.frames = 5, censor=0.5, X=realistic_panel_input, knots=37)

Diagnostics for simulated data

survsim.plot(sim_realpanel, df=1)
TableGrob (2 x 1) "arrange": 2 grobs

survsim.plot(sim_realpanel, df=2)
TableGrob (2 x 1) "arrange": 2 grobs

survsim.plot(sim_realpanel, df=3)
TableGrob (2 x 1) "arrange": 2 grobs

survsim.plot(sim_realpanel, df=4)
TableGrob (2 x 1) "arrange": 2 grobs

survsim.plot(sim_realpanel, df=5)
TableGrob (2 x 1) "arrange": 2 grobs

coxph(Surv(y, failed) ~ ., data=sim_realpanel[[1]]$data)
Call:
coxph(formula = Surv(y, failed) ~ ., data = sim_realpanel[[1]]$data)

                coef  exp(coef)   se(coef)      z        p
F1_0.18    0.1557154  1.1684936  0.1698253  0.917 0.359187
F2_0.2    -0.1897389  0.8271751  0.1588891 -1.194 0.232416
F3_0.3     0.3415119  1.4070733  0.1435623  2.379 0.017367
F4_0.32    0.2474338  1.2807346  0.1372337  1.803 0.071387
F5_0.28   -0.1544709  0.8568684  0.1433447 -1.078 0.281204
F6_0.33   -0.0327933  0.9677386  0.1371943 -0.239 0.811084
F7_0.15   -0.1448133  0.8651838  0.1872933 -0.773 0.439410
F8_0.36    0.1399326  1.1501963  0.1351874  1.035 0.300622
F9_0.28    0.0437893  1.0447622  0.1452530  0.301 0.763057
F10_0.06  -0.2422081  0.7848929  0.2866873 -0.845 0.398194
F11_0.26  -0.1773896  0.8374535  0.1509828 -1.175 0.240035
F12_0.15   0.0330266  1.0335781  0.1726732  0.191 0.848317
F13_0.34   0.0657378  1.0679467  0.1356294  0.485 0.627898
F14_0.14  -0.2272310  0.7967367  0.1873340 -1.213 0.225140
F15_0.21  -0.1152507  0.8911427  0.1576724 -0.731 0.464809
F16_0.36   0.0582195  1.0599476  0.1326131  0.439 0.660649
F17_0.34   0.1820919  1.1997244  0.1388719  1.311 0.189783
F18_0.05  -0.8507291  0.4271034  0.3061126 -2.779 0.005450
F19_0.16  -0.1413055  0.8682241  0.1713304 -0.825 0.409511
F20_0.24  -0.1764784  0.8382169  0.1567818 -1.126 0.260322
F21_0.12   0.1199804  1.1274748  0.2094325  0.573 0.566724
F22_0.16  -0.4260908  0.6530570  0.1919419 -2.220 0.026426
F23_0.07   0.2055792  1.2282362  0.2618763  0.785 0.432439
F24_0.13   0.2106564  1.2344881  0.1782716  1.182 0.237341
F25_0.18   0.0003643  1.0003644  0.1552334  0.002 0.998128
F26_0.17   0.0100692  1.0101201  0.1740955  0.058 0.953878
F27_0.24   0.2818646  1.3255993  0.1478762  1.906 0.056639
F28_0.2    0.4704854  1.6007710  0.1636763  2.874 0.004047
F29_0.2    0.1243756  1.1324411  0.1616035  0.770 0.441517
F30_0.36   0.0786871  1.0818658  0.1306787  0.602 0.547080
F31_0.38   0.0777798  1.0808846  0.1340774  0.580 0.561840
F32_0.36  -0.1725709  0.8414986  0.1327804 -1.300 0.193714
F33_0.29   0.2609918  1.2982170  0.1352679  1.929 0.053676
F34_0.4   -0.4453703  0.6405870  0.1361804 -3.270 0.001074
F35_0.15   0.1650242  1.1794216  0.1874068  0.881 0.378552
F36_0.13  -0.6233559  0.5361422  0.1975725 -3.155 0.001605
F37_0.09  -0.3331133  0.7166890  0.2251116 -1.480 0.138935
F38_0.19   0.0152350  1.0153516  0.1663606  0.092 0.927033
F39_0.23  -0.4706736  0.6245814  0.1597117 -2.947 0.003209
F40_0.21  -0.0962869  0.9082034  0.1658571 -0.581 0.561550
F41_0.21  -0.4650125  0.6281272  0.1578773 -2.945 0.003225
F42_0.07   0.1901788  1.2094659  0.2519709  0.755 0.450390
F43_0.23  -0.2475142  0.7807391  0.1566230 -1.580 0.114034
F44_0.17   0.1109510  1.1173402  0.1784757  0.622 0.534166
F45_0.35   0.1510288  1.1630301  0.1334219  1.132 0.257649
F46_0.07  -0.6133258  0.5415468  0.2505620 -2.448 0.014373
F47_0.14  -0.3716509  0.6895949  0.1961899 -1.894 0.058179
F48_0.31  -0.2399444  0.7866716  0.1393748 -1.722 0.085146
F49_0.38  -0.0614981  0.9403547  0.1338894 -0.459 0.646004
F50_0.39  -0.4029678  0.6683336  0.1304171 -3.090 0.002003
F51_0.26  -0.1579695  0.8538758  0.1487119 -1.062 0.288121
F52_0.27   0.1127987  1.1194065  0.1476147  0.764 0.444782
F53_0.35   0.0277627  1.0281516  0.1305218  0.213 0.831557
F54_0.3    0.1225860  1.1304163  0.1378271  0.889 0.373778
F55_0.11   0.1341277  1.1435388  0.2007594  0.668 0.504069
F56_0.19   0.1875118  1.2062444  0.1553544  1.207 0.227435
F57_0.27  -0.2140184  0.8073335  0.1448597 -1.477 0.139563
F58_0.16  -0.4074865  0.6653204  0.1732724 -2.352 0.018687
F59_0.3    0.0664910  1.0687513  0.1425610  0.466 0.640927
F60_0.28   0.1981548  1.2191511  0.1425443  1.390 0.164490
F61_0.39  -0.2253857  0.7982083  0.1358108 -1.660 0.097004
F62_0.09   0.3763487  1.4569550  0.2236130  1.683 0.092368
F63_0.08  -0.8360548  0.4334171  0.2542314 -3.289 0.001007
F64_0.15  -0.1415383  0.8680219  0.1812661 -0.781 0.434901
F65_0.07   0.3304994  1.3916630  0.2582549  1.280 0.200636
F66_0.14  -0.1224605  0.8847408  0.1818241 -0.674 0.500622
F67_0.19  -0.0946505  0.9096908  0.1721936 -0.550 0.582542
F68_0.11   0.4167288  1.5169910  0.1900824  2.192 0.028354
F69_0.3   -0.0825151  0.9207976  0.1335251 -0.618 0.536592
F70_0.08   0.2042649  1.2266230  0.2147985  0.951 0.341625
F71_0.32   0.2145501  1.2393042  0.1358253  1.580 0.114198
F72_0.08  -0.2667601  0.7658568  0.2352352 -1.134 0.256789
F73_0.36   0.2041283  1.2264554  0.1335812  1.528 0.126482
F74_0.22  -0.0958786  0.9085743  0.1662713 -0.577 0.564183
F75_0.05   0.2068645  1.2298160  0.2748342  0.753 0.451637
F76_0.32  -0.2721859  0.7617127  0.1413277 -1.926 0.054114
F77_0.13  -0.0419572  0.9589108  0.1999340 -0.210 0.833781
F78_0.26  -0.1263690  0.8812896  0.1458496 -0.866 0.386252
F79_0.21   0.3088150  1.3618104  0.1654479  1.867 0.061966
F80_0.07   0.1529990  1.1653238  0.2428018  0.630 0.528603
F81_0.22   0.3669439  1.4433169  0.1569603  2.338 0.019397
F82_0.09  -0.0546563  0.9468105  0.2433897 -0.225 0.822319
F83_0.14  -0.2324435  0.7925945  0.1910262 -1.217 0.223675
F84_0.11   0.0360131  1.0366694  0.1903200  0.189 0.849917
F85_0.35   0.2659318  1.3046461  0.1357461  1.959 0.050108
F86_0.14  -0.0991489  0.9056079  0.1836345 -0.540 0.589249
F87_0.24   0.1588075  1.1721123  0.1517640  1.046 0.295371
F88_0.16  -0.3196894  0.7263746  0.1758968 -1.817 0.069143
F89_0.38   0.1572082  1.1702393  0.1338867  1.174 0.240319
F90_0.08   0.3145218  1.3696042  0.2093215  1.503 0.132948
F91_0.39   0.0034904  1.0034965  0.1295632  0.027 0.978508
F92_0.1   -0.1700078  0.8436583  0.2292916 -0.741 0.458422
F93_0.27  -0.0838073  0.9196084  0.1564016 -0.536 0.592064
F94_0.08  -0.9456150  0.3884406  0.2705040 -3.496 0.000473
F95_0.34  -0.0583252  0.9433431  0.1424509 -0.409 0.682216
F96_0.35  -0.3178441  0.7277163  0.1384793 -2.295 0.021719
F97_0.39  -0.2480746  0.7803017  0.1320034 -1.879 0.060203
F98_0.33   0.1227886  1.1306454  0.1347949  0.911 0.362333
F99_0.29   0.1115660  1.1180275  0.1328824  0.840 0.401141
F100_0.29  0.1692075  1.1843659  0.1429301  1.184 0.236473
F101_0.32 -0.0417676  0.9590927  0.1350317 -0.309 0.757081
F102_0.16 -0.2553683  0.7746311  0.1825445 -1.399 0.161832
F103_0.32 -0.0280939  0.9722971  0.1410200 -0.199 0.842091
F104_0.4  -0.1806977  0.8346877  0.1357645 -1.331 0.183201
F105_0.12 -0.4677713  0.6263968  0.2140816 -2.185 0.028888
F106_0.18  0.0037064  1.0037133  0.1603822  0.023 0.981563
F107_0.22 -0.4649510  0.6281659  0.1599679 -2.907 0.003655
F108_0.24 -0.0014240  0.9985770  0.1506830 -0.009 0.992460
F109_0.25  0.0208075  1.0210255  0.1546212  0.135 0.892951
F110_0.25 -0.0563127  0.9452435  0.1461418 -0.385 0.699994
F111_0.17  0.6059888  1.8330639  0.1739532  3.484 0.000495
F112_0.15  0.6649368  1.9443677  0.1867011  3.562 0.000369
F113_0.11 -0.2462548  0.7817230  0.2130412 -1.156 0.247721
F114_0.28 -0.0170717  0.9830732  0.1422405 -0.120 0.904467
F115_0.12  0.1369680  1.1467915  0.2050164  0.668 0.504080
F116_0.24 -0.0951796  0.9092096  0.1493523 -0.637 0.523941
F117_0.15  0.2927857  1.3401556  0.1900345  1.541 0.123390
F118_0.25  0.0551323  1.0566804  0.1483540  0.372 0.710171
F119_0.13  0.1407642  1.1511531  0.2109508  0.667 0.504591
F120_0.4  -0.1230193  0.8842466  0.1334087 -0.922 0.356464
F121_0.14 -0.2108397  0.8099039  0.1900771 -1.109 0.267330
F122_0.11 -0.3249386  0.7225717  0.2041430 -1.592 0.111447
F123_0.17  0.2913250  1.3381994  0.1748097  1.667 0.095609
F124_0.38 -0.0299068  0.9705360  0.1306872 -0.229 0.818991
F125_0.22 -0.0345144  0.9660744  0.1550249 -0.223 0.823817
F126_0.05  0.0103440  1.0103977  0.3252484  0.032 0.974629
F127_0.22 -0.1521922  0.8588232  0.1469791 -1.035 0.300450
F128_0.33  0.2218169  1.2483427  0.1364213  1.626 0.103956
F129_0.09 -0.0267972  0.9735587  0.2408961 -0.111 0.911426
F130_0.17  0.5428755  1.7209484  0.1723857  3.149 0.001637
F131_0.19 -0.2472596  0.7809380  0.1697225 -1.457 0.145159
F132_0.19  0.2343421  1.2640768  0.1568001  1.495 0.135038
F133_0.37 -0.5614845  0.5703618  0.1349849 -4.160 3.19e-05
F134_0.3   0.1813941  1.1988876  0.1357341  1.336 0.181421
F135_0.09  0.1757527  1.1921432  0.2422288  0.726 0.468106
F136_0.18 -0.0229950  0.9772674  0.1777442 -0.129 0.897064
F137_0.15  0.5013162  1.6508928  0.1806433  2.775 0.005517
F138_0.07 -0.5785196  0.5607278  0.2782055 -2.079 0.037574
F139_0.35  0.2794563  1.3224106  0.1379184  2.026 0.042740
F140_0.12 -0.1528387  0.8582681  0.1935676 -0.790 0.429768
F141_0.26 -0.7365038  0.4787849  0.1529247 -4.816 1.46e-06
F142_0.06 -0.1779801  0.8369591  0.2825744 -0.630 0.528791
F143_0.05  0.9857410  2.6797970  0.2703415  3.646 0.000266
F144_0.26  0.3043254  1.3557101  0.1392142  2.186 0.028814
F145_0.14  0.0408814  1.0417285  0.1908208  0.214 0.830360
F146_0.17  0.2059303  1.2286676  0.1702150  1.210 0.226346
F147_0.28  0.2849560  1.3297035  0.1388933  2.052 0.040207
F148_0.27 -0.2124739  0.8085814  0.1459212 -1.456 0.145369
F149_0.19 -0.1404566  0.8689613  0.1669680 -0.841 0.400225
F150_0.14 -0.3510181  0.7039710  0.1843087 -1.905 0.056844
F151_0.22 -0.0091363  0.9909053  0.1505646 -0.061 0.951614
F152_0.34  0.0304809  1.0309502  0.1344152  0.227 0.820605
F153_0.21  0.1044184  1.1100648  0.1567206  0.666 0.505238
F154_0.08 -0.2694443  0.7638038  0.2262449 -1.191 0.233677
F155_0.38  0.1320124  1.1411225  0.1330314  0.992 0.321032
F156_0.4   0.1066147  1.1125055  0.1293691  0.824 0.409876
F157_0.07 -0.0355382  0.9650859  0.2436756 -0.146 0.884046
F158_0.07  0.1835878  1.2015204  0.2574866  0.713 0.475846
F159_0.35  0.0828244  1.0863510  0.1351579  0.613 0.540010
F160_0.13  0.7447921  2.1060035  0.1997289  3.729 0.000192
F161_0.16 -0.0050454  0.9949673  0.1804992 -0.028 0.977700
F162_0.35 -0.3166208  0.7286070  0.1379215 -2.296 0.021695
F163_0.12  0.2287772  1.2570620  0.1889396  1.211 0.225953
F164_0.28 -0.4346645  0.6474818  0.1438272 -3.022 0.002510
F165_0.13  0.0166566  1.0167961  0.1993876  0.084 0.933423
F166_0.36 -0.2955640  0.7441118  0.1398465 -2.113 0.034559
F167_0.19 -0.3087552  0.7343605  0.1666037 -1.853 0.063849
F168_0.06 -0.3154402  0.7294677  0.2780960 -1.134 0.256675
F169_0.24  0.0019321  1.0019340  0.1522391  0.013 0.989874
F170_0.35 -0.1944913  0.8232533  0.1356436 -1.434 0.151618
F171_0.09  0.0736142  1.0763915  0.2182293  0.337 0.735872
F172_0.3   0.0794111  1.0826493  0.1433944  0.554 0.579719
F173_0.27  0.0302218  1.0306831  0.1401160  0.216 0.829228
F174_0.36 -0.0791246  0.9239248  0.1340199 -0.590 0.554926
F175_0.24 -0.2643964  0.7676691  0.1504048 -1.758 0.078765
F176_0.32 -0.0257090  0.9746187  0.1367628 -0.188 0.850890
F177_0.2  -0.6064282  0.5452951  0.1614038 -3.757 0.000172
F178_0.35 -0.1823571  0.8333037  0.1317001 -1.385 0.166163
F179_0.06  0.2249554  1.2522668  0.2391832  0.941 0.346954
F180_0.23 -0.0315793  0.9689141  0.1424071 -0.222 0.824506
F181_0.12  0.4489902  1.5667293  0.2048480  2.192 0.028392
F182_0.22 -0.0019743  0.9980276  0.1536913 -0.013 0.989751
F183_0.06  0.4445222  1.5597448  0.2453785  1.812 0.070052
F184_0.25  0.3438125  1.4103141  0.1428342  2.407 0.016081
F185_0.4  -0.0040093  0.9959987  0.1303591 -0.031 0.975464
F186_0.08 -0.0077660  0.9922641  0.2123115 -0.037 0.970821
F187_0.24  0.0245993  1.0249044  0.1529676  0.161 0.872240
F188_0.1  -0.2899597  0.7482938  0.2200041 -1.318 0.187512
F189_0.31  0.4818045  1.6189933  0.1369824  3.517 0.000436
F190_0.05  0.5382679  1.7130372  0.2658715  2.025 0.042914
F191_0.18  0.0697078  1.0721948  0.1719389  0.405 0.685168
F192_0.11  0.5471973  1.7284020  0.2024457  2.703 0.006873
F193_0.32 -0.4234428  0.6547886  0.1388294 -3.050 0.002288
F194_0.4   0.1014995  1.1068293  0.1351077  0.751 0.452503
F195_0.21 -0.2124599  0.8085927  0.1578564 -1.346 0.178333
F196_0.37  0.1975686  1.2184366  0.1296740  1.524 0.127614
F197_0.35  0.1595770  1.1730146  0.1383014  1.154 0.248568
F198_0.25  0.4123771  1.5104038  0.1408899  2.927 0.003423
F199_0.07  0.6804522  1.9747706  0.2649096  2.569 0.010210
F200_0.11  0.5143577  1.6725638  0.1992710  2.581 0.009846
 [ reached getOption("max.print") -- omitted 100 rows ]

Likelihood ratio test=560.4  on 300 df, p=< 2.2e-16
n= 1000, number of events= 493 
head(sim_realpanel[[1]]$betas, 5)
            [,1]
[1,]  0.07153690
[2,] -0.15811752
[3,]  0.09630148
[4,]  0.13092209
[5,]  0.07115639
ggplot() + geom_histogram(aes(x=sim_realpanel[[1]]$betas)) + ggtitle("Distribution of 'true' coefficients of simulated real panel")

VAF-style panel data

This is derived wholly from the “realistic” panel data, but it’s updated to use VAFs instead of binary data. These VAFs are randomly generated among possible VAFs, with a cap of 1 (germline homozygous variant) and a floor of 0.05, which represents the minimum VAF that can be reliably detected for most variant callers.

real_vaf_panel_input <- realistic_panel_input
real_var_panel_input <- apply(real_vaf_panel_input, 1:2, function(x) x*(sample(5:100,1)/100))

Generate 5 simulated survival datasets

sim_realVAF_panel <- sim.survdata(T=1825, num.data.frames = 5, censor=0.5, X=real_var_panel_input, knots=37)

Diagnostics for simulated data

survsim.plot(sim_realVAF_panel, df=1)
TableGrob (2 x 1) "arrange": 2 grobs

survsim.plot(sim_realVAF_panel, df=2)
TableGrob (2 x 1) "arrange": 2 grobs

survsim.plot(sim_realVAF_panel, df=3)
TableGrob (2 x 1) "arrange": 2 grobs

survsim.plot(sim_realVAF_panel, df=4)
TableGrob (2 x 1) "arrange": 2 grobs

survsim.plot(sim_realVAF_panel, df=5)
TableGrob (2 x 1) "arrange": 2 grobs

coxph(Surv(y, failed) ~ ., data=sim_realVAF_panel[[1]]$data)
Call:
coxph(formula = Surv(y, failed) ~ ., data = sim_realVAF_panel[[1]]$data)

               coef exp(coef)  se(coef)      z        p
F1_0.18    0.269149  1.308850  0.258994  1.039 0.298708
F2_0.2     0.115265  1.122171  0.285081  0.404 0.685973
F3_0.3    -0.315720  0.729263  0.250716 -1.259 0.207931
F4_0.32    0.059861  1.061689  0.224177  0.267 0.789450
F5_0.28   -0.666518  0.513494  0.243768 -2.734 0.006253
F6_0.33   -0.185656  0.830559  0.217108 -0.855 0.392478
F7_0.15   -0.195507  0.822417  0.281667 -0.694 0.487615
F8_0.36   -0.719152  0.487165  0.233220 -3.084 0.002045
F9_0.28    0.457967  1.580857  0.250359  1.829 0.067364
F10_0.06  -0.865349  0.420905  0.529940 -1.633 0.102486
F11_0.26   0.212835  1.237180  0.257329  0.827 0.408185
F12_0.15   0.290361  1.336910  0.279136  1.040 0.298240
F13_0.34  -0.205359  0.814355  0.226333 -0.907 0.364231
F14_0.14  -0.238438  0.787858  0.295759 -0.806 0.420134
F15_0.21  -0.281593  0.754581  0.273107 -1.031 0.302507
F16_0.36  -0.252742  0.776668  0.231182 -1.093 0.274280
F17_0.34   0.046317  1.047406  0.215231  0.215 0.829614
F18_0.05  -0.277314  0.757816  0.496461 -0.559 0.576446
F19_0.16   0.484058  1.622646  0.308592  1.569 0.116741
F20_0.24   0.070272  1.072799  0.250332  0.281 0.778930
F21_0.12  -0.159550  0.852528  0.381643 -0.418 0.675903
F22_0.16  -0.725423  0.484120  0.350652 -2.069 0.038567
F23_0.07  -0.577652  0.561214  0.480629 -1.202 0.229415
F24_0.13  -0.870451  0.418763  0.332745 -2.616 0.008897
F25_0.18   0.361466  1.435432  0.269082  1.343 0.179165
F26_0.17   0.359587  1.432738  0.306176  1.174 0.240217
F27_0.24  -0.626715  0.534344  0.249133 -2.516 0.011883
F28_0.2    0.223461  1.250397  0.263193  0.849 0.395860
F29_0.2    0.853907  2.348805  0.282794  3.020 0.002532
F30_0.36  -0.200076  0.818668  0.219061 -0.913 0.361065
F31_0.38  -0.114763  0.891578  0.209075 -0.549 0.583070
F32_0.36   0.529234  1.697631  0.212572  2.490 0.012786
F33_0.29  -0.568575  0.566332  0.230758 -2.464 0.013742
F34_0.4    0.145303  1.156389  0.221536  0.656 0.511898
F35_0.15  -0.176038  0.838587  0.291139 -0.605 0.545410
F36_0.13   0.196230  1.216807  0.326099  0.602 0.547339
F37_0.09   0.555160  1.742220  0.345023  1.609 0.107605
F38_0.19   0.343534  1.409921  0.259773  1.322 0.186022
F39_0.23   0.211935  1.236067  0.256338  0.827 0.408363
F40_0.21   0.037852  1.038578  0.240165  0.158 0.874764
F41_0.21   0.462497  1.588034  0.272079  1.700 0.089157
F42_0.07   0.370491  1.448445  0.427582  0.866 0.386228
F43_0.23  -0.371277  0.689853  0.270799 -1.371 0.170361
F44_0.17   0.559725  1.750190  0.298850  1.873 0.061078
F45_0.35  -0.022051  0.978190  0.210890 -0.105 0.916723
F46_0.07   0.052915  1.054341  0.376567  0.141 0.888248
F47_0.14  -0.061682  0.940182  0.332118 -0.186 0.852663
F48_0.31   0.092323  1.096719  0.234328  0.394 0.693589
F49_0.38  -0.422716  0.655265  0.210163 -2.011 0.044287
F50_0.39   0.153286  1.165658  0.194942  0.786 0.431683
F51_0.26   0.190220  1.209515  0.250311  0.760 0.447295
F52_0.27   0.138057  1.148041  0.231342  0.597 0.550665
F53_0.35  -0.154511  0.856834  0.221521 -0.698 0.485490
F54_0.3    0.178270  1.195148  0.232741  0.766 0.443700
F55_0.11   0.134072  1.143475  0.312633  0.429 0.668034
F56_0.19   0.661370  1.937444  0.269249  2.456 0.014036
F57_0.27  -0.094204  0.910097  0.261973 -0.360 0.719150
F58_0.16   0.536179  1.709463  0.271502  1.975 0.048283
F59_0.3   -0.453586  0.635346  0.241195 -1.881 0.060029
F60_0.28   0.664006  1.942559  0.239627  2.771 0.005589
F61_0.39   0.414736  1.513970  0.209627  1.978 0.047879
F62_0.09   0.435787  1.546180  0.383209  1.137 0.255453
F63_0.08   0.837436  2.310435  0.425984  1.966 0.049312
F64_0.15   0.469186  1.598692  0.309138  1.518 0.129085
F65_0.07  -0.305757  0.736566  0.439490 -0.696 0.486612
F66_0.14   0.818830  2.267846  0.303133  2.701 0.006908
F67_0.19   0.412016  1.509858  0.274222  1.502 0.132970
F68_0.11  -0.336596  0.714197  0.316771 -1.063 0.287970
F69_0.3    0.258895  1.295498  0.219800  1.178 0.238849
F70_0.08   0.517346  1.677569  0.395340  1.309 0.190666
F71_0.32   0.478017  1.612873  0.231072  2.069 0.038575
F72_0.08   0.726766  2.068381  0.359403  2.022 0.043161
F73_0.36   0.564734  1.758980  0.200697  2.814 0.004895
F74_0.22   0.368016  1.444865  0.272708  1.349 0.177180
F75_0.05   0.567915  1.764584  0.582885  0.974 0.329899
F76_0.32   0.245642  1.278442  0.226532  1.084 0.278205
F77_0.13   0.480343  1.616629  0.340515  1.411 0.158352
F78_0.26  -0.078280  0.924706  0.245893 -0.318 0.750220
F79_0.21  -0.046717  0.954358  0.275692 -0.169 0.865441
F80_0.07   0.466689  1.594706  0.470948  0.991 0.321706
F81_0.22  -0.341212  0.710908  0.258564 -1.320 0.186954
F82_0.09  -1.191112  0.303883  0.429947 -2.770 0.005599
F83_0.14  -0.448426  0.638633  0.311034 -1.442 0.149380
F84_0.11   0.647340  1.910453  0.296197  2.186 0.028852
F85_0.35  -0.147616  0.862763  0.210907 -0.700 0.483985
F86_0.14   0.278787  1.321526  0.290759  0.959 0.337647
F87_0.24   0.397139  1.487562  0.258890  1.534 0.125028
F88_0.16   0.577244  1.781123  0.289249  1.996 0.045970
F89_0.38   0.286149  1.331291  0.216971  1.319 0.187224
F90_0.08   0.468954  1.598321  0.353877  1.325 0.185109
F91_0.39  -0.037263  0.963422  0.209115 -0.178 0.858569
F92_0.1   -0.613119  0.541659  0.347185 -1.766 0.077400
F93_0.27  -0.425118  0.653693  0.236066 -1.801 0.071728
F94_0.08  -0.085695  0.917874  0.414872 -0.207 0.836355
F95_0.34  -0.183756  0.832139  0.221345 -0.830 0.406436
F96_0.35   0.166575  1.181252  0.234270  0.711 0.477060
F97_0.39   0.299728  1.349491  0.203339  1.474 0.140473
F98_0.33  -0.091698  0.912381  0.230570 -0.398 0.690850
F99_0.29   0.649519  1.914620  0.235225  2.761 0.005758
F100_0.29 -0.015762  0.984361  0.236960 -0.067 0.946964
F101_0.32  0.336293  1.399749  0.216661  1.552 0.120623
F102_0.16  0.247747  1.281135  0.294027  0.843 0.399453
F103_0.32 -0.448650  0.638490  0.252858 -1.774 0.076011
F104_0.4   0.205904  1.228635  0.224012  0.919 0.358010
F105_0.12  0.232232  1.261412  0.357367  0.650 0.515795
F106_0.18 -0.816524  0.441965  0.273401 -2.987 0.002821
F107_0.22 -0.632996  0.530999  0.291098 -2.175 0.029667
F108_0.24 -0.397622  0.671916  0.255670 -1.555 0.119894
F109_0.25 -0.245616  0.782223  0.266625 -0.921 0.356944
F110_0.25 -0.156525  0.855110  0.238562 -0.656 0.511747
F111_0.17  0.459517  1.583309  0.283828  1.619 0.105447
F112_0.15  0.047562  1.048712  0.303953  0.156 0.875655
F113_0.11  0.144594  1.155570  0.345912  0.418 0.675943
F114_0.28 -0.459162  0.631813  0.250764 -1.831 0.067093
F115_0.12  0.359379  1.432439  0.341658  1.052 0.292861
F116_0.24  0.083694  1.087296  0.237425  0.353 0.724458
F117_0.15  0.019717  1.019913  0.297884  0.066 0.947226
F118_0.25 -0.207434  0.812667  0.251541 -0.825 0.409569
F119_0.13  1.026632  2.791647  0.313864  3.271 0.001072
F120_0.4   0.273204  1.314169  0.202852  1.347 0.178038
F121_0.14 -1.018229  0.361234  0.318331 -3.199 0.001381
F122_0.11 -0.270799  0.762770  0.358957 -0.754 0.450606
F123_0.17 -0.356485  0.700133  0.289192 -1.233 0.217690
F124_0.38  0.480933  1.617583  0.210111  2.289 0.022083
F125_0.22 -0.086621  0.917024  0.259628 -0.334 0.738654
F126_0.05 -0.158076  0.853785  0.537690 -0.294 0.768765
F127_0.22 -0.477218  0.620507  0.257153 -1.856 0.063486
F128_0.33  0.026130  1.026475  0.212325  0.123 0.902053
F129_0.09 -0.369884  0.690814  0.414527 -0.892 0.372230
F130_0.17  0.025860  1.026198  0.286019  0.090 0.927958
F131_0.19 -0.199934  0.818785  0.282718 -0.707 0.479451
F132_0.19 -0.697687  0.497735  0.263449 -2.648 0.008090
F133_0.37  0.069696  1.072182  0.221747  0.314 0.753290
F134_0.3   0.236231  1.266467  0.220826  1.070 0.284728
F135_0.09  0.158078  1.171257  0.403578  0.392 0.695287
F136_0.18 -0.390367  0.676808  0.306328 -1.274 0.202541
F137_0.15  0.451674  1.570941  0.316821  1.426 0.153970
F138_0.07  0.716426  2.047104  0.397089  1.804 0.071201
F139_0.35  0.040006  1.040817  0.215425  0.186 0.852674
F140_0.12 -0.028676  0.971732  0.346142 -0.083 0.933976
F141_0.26  0.339663  1.404474  0.237555  1.430 0.152767
F142_0.06  0.430786  1.538466  0.506805  0.850 0.395323
F143_0.05  0.211967  1.236107  0.484900  0.437 0.662014
F144_0.26  0.200948  1.222561  0.246989  0.814 0.415879
F145_0.14  0.636845  1.890508  0.356124  1.788 0.073733
F146_0.17 -0.350447  0.704373  0.266612 -1.314 0.188696
F147_0.28 -0.368260  0.691937  0.233168 -1.579 0.114250
F148_0.27 -0.301097  0.740006  0.237082 -1.270 0.204081
F149_0.19 -0.266023  0.766422  0.288905 -0.921 0.357157
F150_0.14  0.268648  1.308195  0.280349  0.958 0.337930
F151_0.22  0.494900  1.640335  0.255355  1.938 0.052612
F152_0.34  0.174844  1.191061  0.226303  0.773 0.439753
F153_0.21 -0.547358  0.578476  0.253864 -2.156 0.031075
F154_0.08 -0.065611  0.936495  0.406849 -0.161 0.871883
F155_0.38 -0.215476  0.806158  0.213278 -1.010 0.312347
F156_0.4  -0.349886  0.704768  0.218453 -1.602 0.109231
F157_0.07 -0.564083  0.568881  0.406730 -1.387 0.165480
F158_0.07 -0.831484  0.435403  0.441529 -1.883 0.059674
F159_0.35 -0.141219  0.868299  0.220130 -0.642 0.521183
F160_0.13 -0.108600  0.897089  0.323838 -0.335 0.737359
F161_0.16 -0.431996  0.649212  0.286911 -1.506 0.132150
F162_0.35  0.455401  1.576806  0.213628  2.132 0.033027
F163_0.12 -0.305652  0.736643  0.315069 -0.970 0.331991
F164_0.28  0.005375  1.005390  0.223009  0.024 0.980770
F165_0.13  0.007864  1.007895  0.301879  0.026 0.979216
F166_0.36 -0.526473  0.590685  0.219862 -2.395 0.016640
F167_0.19  0.079538  1.082787  0.271232  0.293 0.769334
F168_0.06 -0.222146  0.800798  0.426262 -0.521 0.602263
F169_0.24  0.299958  1.349802  0.269803  1.112 0.266238
F170_0.35 -0.044208  0.956755  0.234196 -0.189 0.850279
F171_0.09  0.462110  1.587420  0.375359  1.231 0.218279
F172_0.3   0.163416  1.177527  0.224537  0.728 0.466740
F173_0.27  0.681072  1.975994  0.224838  3.029 0.002452
F174_0.36 -0.463013  0.629385  0.225580 -2.053 0.040117
F175_0.24  0.219710  1.245716  0.242936  0.904 0.365786
F176_0.32  0.360595  1.434182  0.239816  1.504 0.132677
F177_0.2   0.248118  1.281611  0.270105  0.919 0.358305
F178_0.35 -0.452427  0.636083  0.218335 -2.072 0.038250
F179_0.06  0.419968  1.521913  0.416174  1.009 0.312919
F180_0.23  0.400805  1.493026  0.257532  1.556 0.119630
F181_0.12 -0.052287  0.949056  0.312402 -0.167 0.867077
F182_0.22  0.153568  1.165987  0.273733  0.561 0.574789
F183_0.06 -0.321239  0.725250  0.560764 -0.573 0.566739
F184_0.25  0.038078  1.038812  0.245938  0.155 0.876958
F185_0.4   0.365473  1.441195  0.216070  1.691 0.090750
F186_0.08 -0.145068  0.864963  0.362545 -0.400 0.689054
F187_0.24  0.345764  1.413070  0.238073  1.452 0.146406
F188_0.1   0.455604  1.577125  0.318895  1.429 0.153092
F189_0.31 -0.047695  0.953425  0.231808 -0.206 0.836984
F190_0.05 -0.697716  0.497721  0.497245 -1.403 0.160569
F191_0.18  0.609457  1.839433  0.259024  2.353 0.018628
F192_0.11 -0.260074  0.770994  0.383402 -0.678 0.497560
F193_0.32  0.522903  1.686917  0.220926  2.367 0.017940
F194_0.4  -0.371667  0.689584  0.223530 -1.663 0.096370
F195_0.21  0.951033  2.588382  0.246940  3.851 0.000118
F196_0.37  0.255872  1.291587  0.215928  1.185 0.236023
F197_0.35  0.242106  1.273929  0.204941  1.181 0.237466
F198_0.25 -0.301815  0.739475  0.249849 -1.208 0.227050
F199_0.07 -1.336017  0.262891  0.478099 -2.794 0.005199
F200_0.11  0.067147  1.069453  0.327448  0.205 0.837524
 [ reached getOption("max.print") -- omitted 100 rows ]

Likelihood ratio test=477.6  on 300 df, p=2.844e-10
n= 1000, number of events= 495 
head(sim_realVAF_panel[[1]]$betas, 10)
               [,1]
 [1,]  0.0575910279
 [2,]  0.0056428074
 [3,] -0.0394145636
 [4,] -0.0844503464
 [5,] -0.1038468228
 [6,] -0.0548150277
 [7,]  0.0000433159
 [8,]  0.0066601553
 [9,]  0.1917508450
[10,]  0.0676687202
ggplot() + geom_histogram(aes(x=sim_realVAF_panel[[1]]$betas)) + ggtitle("Distribution of 'true' coefficients of simulated real panel with VAFs")

Reduced VAF panel data based on selecting only variables with “high” coefficients

These are still quite low in terms of coefficient, so we may need another approach (create smaller lists of variables and then pad them with totally random variables)

sim_realVAF_panel_reduc <- list()
sim_realVAF_panel_reduc[[1]] <- sim_realVAF_panel[[1]]$data[,abs(sim_realVAF_panel[[1]]$betas) > 0.15]
sim_realVAF_panel_reduc[[1]]$y <- sim_realVAF_panel[[1]]$data$y
sim_realVAF_panel_reduc[[1]]$failed <- sim_realVAF_panel[[1]]$data$failed

sim_realVAF_panel_reduc[[2]] <- sim_realVAF_panel[[2]]$data[,abs(sim_realVAF_panel[[2]]$betas) > 0.15]
sim_realVAF_panel_reduc[[2]]$y <- sim_realVAF_panel[[2]]$data$y
sim_realVAF_panel_reduc[[2]]$failed <- sim_realVAF_panel[[2]]$data$failed

sim_realVAF_panel_reduc[[3]] <- sim_realVAF_panel[[3]]$data[,abs(sim_realVAF_panel[[3]]$betas) > 0.15]
sim_realVAF_panel_reduc[[3]]$y <- sim_realVAF_panel[[3]]$data$y
sim_realVAF_panel_reduc[[3]]$failed <- sim_realVAF_panel[[3]]$data$failed

sim_realVAF_panel_reduc[[4]] <- sim_realVAF_panel[[4]]$data[,abs(sim_realVAF_panel[[4]]$betas) > 0.15]
sim_realVAF_panel_reduc[[4]]$y <- sim_realVAF_panel[[4]]$data$y
sim_realVAF_panel_reduc[[4]]$failed <- sim_realVAF_panel[[4]]$data$failed

sim_realVAF_panel_reduc[[5]] <- sim_realVAF_panel[[5]]$data[,abs(sim_realVAF_panel[[5]]$betas) > 0.15]
sim_realVAF_panel_reduc[[5]]$y <- sim_realVAF_panel[[5]]$data$y
sim_realVAF_panel_reduc[[5]]$failed <- sim_realVAF_panel[[5]]$data$failed

Positive control (small number of continuous variables)

Since these seem to behave poorly in the random survival forest, let’s take a different approach and build a minimal positive control that we can then modify to test modeling tolerances.

Generate binary dataset

1000 samples with 20 gene variables should lead to relatively high coefficients.

set.seed(12345)
pos_ctrl_input <- binary_generator(samp_num = 1000, var_num = 10, var_freq = c(5,40))
binary_generator: Variable frequency mode. Pos freq = 5:40
pos_ctrl_input <- apply(pos_ctrl_input, 1:2, function(x) x*(sample(5:100,1)/100))
head(pos_ctrl_input)
     F1_0.18 F2_0.2 F3_0.3 F4_0.32 F5_0.28 F6_0.33 F7_0.15 F8_0.36 F9_0.28 F10_0.06
[1,]       0   0.00   0.00    0.00    0.45    0.00       0    0.00    0.33     0.00
[2,]       0   0.00   0.00    0.71    0.00    0.12       0    0.13    0.45     0.00
[3,]       0   0.74   0.00    0.57    0.00    0.00       0    0.00    0.00     0.96
[4,]       0   0.00   0.00    0.00    0.00    0.00       0    0.82    0.00     0.00
[5,]       0   0.00   0.23    0.00    0.00    0.00       0    0.29    0.59     0.00
[6,]       0   0.00   0.00    0.10    0.00    0.73       0    0.00    0.00     0.00

Generate 5 simulated survival datasets

sim_pos_ctrl <- sim.survdata(T=1825, N=1000, num.data.frames = 5, censor=0.5, knots=37, X=pos_ctrl_input, beta = c(0,-1, 1, -2, 2, -3, 3, -4, 4, 5))
129 observations have drawn durations
                                    at the minimum or maximum possible value. Generating coefficients
                                    and other quantities of interest are unlikely to be returned
                                    by models due to truncation. Consider making user-supplied coefficients
                                    smaller, making T bigger, or decreasing the variance of the X variables.174 observations have drawn durations
                                    at the minimum or maximum possible value. Generating coefficients
                                    and other quantities of interest are unlikely to be returned
                                    by models due to truncation. Consider making user-supplied coefficients
                                    smaller, making T bigger, or decreasing the variance of the X variables.213 observations have drawn durations
                                    at the minimum or maximum possible value. Generating coefficients
                                    and other quantities of interest are unlikely to be returned
                                    by models due to truncation. Consider making user-supplied coefficients
                                    smaller, making T bigger, or decreasing the variance of the X variables.172 observations have drawn durations
                                    at the minimum or maximum possible value. Generating coefficients
                                    and other quantities of interest are unlikely to be returned
                                    by models due to truncation. Consider making user-supplied coefficients
                                    smaller, making T bigger, or decreasing the variance of the X variables.149 observations have drawn durations
                                    at the minimum or maximum possible value. Generating coefficients
                                    and other quantities of interest are unlikely to be returned
                                    by models due to truncation. Consider making user-supplied coefficients
                                    smaller, making T bigger, or decreasing the variance of the X variables.

Diagnostics for simulated data

survsim.plot(sim_pos_ctrl, df=1)
TableGrob (2 x 1) "arrange": 2 grobs

survsim.plot(sim_pos_ctrl, df=2)
TableGrob (2 x 1) "arrange": 2 grobs

survsim.plot(sim_pos_ctrl, df=3)
TableGrob (2 x 1) "arrange": 2 grobs

survsim.plot(sim_pos_ctrl, df=4)
TableGrob (2 x 1) "arrange": 2 grobs

survsim.plot(sim_pos_ctrl, df=5)
TableGrob (2 x 1) "arrange": 2 grobs

coxph(Surv(y, failed) ~ ., data=sim_pos_ctrl[[1]]$data)
Call:
coxph(formula = Surv(y, failed) ~ ., data = sim_pos_ctrl[[1]]$data)

              coef exp(coef)  se(coef)       z        p
F1_0.18   -0.04555   0.95547   0.20368  -0.224    0.823
F2_0.2    -0.91856   0.39909   0.18814  -4.882 1.05e-06
F3_0.3     0.97739   2.65751   0.17477   5.593 2.24e-08
F4_0.32   -1.82048   0.16195   0.17680 -10.297  < 2e-16
F5_0.28    1.50225   4.49179   0.20952   7.170 7.51e-13
F6_0.33   -2.78313   0.06184   0.19226 -14.476  < 2e-16
F7_0.15    2.46991  11.82133   0.24636  10.026  < 2e-16
F8_0.36   -3.60427   0.02721   0.19976 -18.043  < 2e-16
F9_0.28    3.61584  37.18268   0.22071  16.383  < 2e-16
F10_0.06   4.65468 105.07540   0.34195  13.612  < 2e-16

Likelihood ratio test=676  on 10 df, p=< 2.2e-16
n= 1000, number of events= 505 
head(sim_pos_ctrl[[1]]$betas, 10)
      [,1]
 [1,]    0
 [2,]   -1
 [3,]    1
 [4,]   -2
 [5,]    2
 [6,]   -3
 [7,]    3
 [8,]   -4
 [9,]    4
[10,]    5
ggplot() + geom_histogram(aes(x=sim_pos_ctrl[[1]]$betas)) + ggtitle("Distribution of 'true' coefficients of simulated VAF positive control")

ggplot() + geom_histogram(aes(x=sim_pos_ctrl[[2]]$betas)) + ggtitle("Distribution of 'true' coefficients of simulated VAF positive control")

ggplot() + geom_histogram(aes(x=sim_pos_ctrl[[3]]$betas)) + ggtitle("Distribution of 'true' coefficients of simulated VAF positive control")

ggplot() + geom_histogram(aes(x=sim_pos_ctrl[[4]]$betas)) + ggtitle("Distribution of 'true' coefficients of simulated VAF positive control")

ggplot() + geom_histogram(aes(x=sim_pos_ctrl[[5]]$betas)) + ggtitle("Distribution of 'true' coefficients of simulated VAF positive control")

Positive control

Let’s clean up this positive control a little bit. I think it makes sense to find the “minimum positive control” we can use. 10 variables seems a decent number, but let’s see how big the coefficients need to be in order for it to behave well. Maybe we can cut down on the number of aberrant samples at the topend (the warnings thrown by the function).

Generate data

sim_pos_ctrl <- sim.survdata(T=1825, N=1000, num.data.frames = 5, censor=0.5, knots=37, X=pos_ctrl_input, beta = c(0,-0.25, 0.25, -0.5, 0.5, -0.75, 0.75, -1, 1, 2))
15 observations have drawn durations
                                    at the minimum or maximum possible value. Generating coefficients
                                    and other quantities of interest are unlikely to be returned
                                    by models due to truncation. Consider making user-supplied coefficients
                                    smaller, making T bigger, or decreasing the variance of the X variables.57 observations have drawn durations
                                    at the minimum or maximum possible value. Generating coefficients
                                    and other quantities of interest are unlikely to be returned
                                    by models due to truncation. Consider making user-supplied coefficients
                                    smaller, making T bigger, or decreasing the variance of the X variables.16 observations have drawn durations
                                    at the minimum or maximum possible value. Generating coefficients
                                    and other quantities of interest are unlikely to be returned
                                    by models due to truncation. Consider making user-supplied coefficients
                                    smaller, making T bigger, or decreasing the variance of the X variables.

Diagnostics for simulated data

survsim.plot(sim_pos_ctrl, df=1)
TableGrob (2 x 1) "arrange": 2 grobs

survsim.plot(sim_pos_ctrl, df=2)
TableGrob (2 x 1) "arrange": 2 grobs

survsim.plot(sim_pos_ctrl, df=3)
TableGrob (2 x 1) "arrange": 2 grobs

survsim.plot(sim_pos_ctrl, df=4)
TableGrob (2 x 1) "arrange": 2 grobs

survsim.plot(sim_pos_ctrl, df=5)
TableGrob (2 x 1) "arrange": 2 grobs

coxph(Surv(y, failed) ~ ., data=sim_pos_ctrl[[1]]$data)
Call:
coxph(formula = Surv(y, failed) ~ ., data = sim_pos_ctrl[[1]]$data)

             coef exp(coef) se(coef)      z        p
F1_0.18  -0.06903   0.93330  0.20437 -0.338 0.735556
F2_0.2   -0.51872   0.59528  0.18584 -2.791 0.005251
F3_0.3    0.22023   1.24636  0.17289  1.274 0.202737
F4_0.32  -0.52714   0.59029  0.15810 -3.334 0.000856
F5_0.28   0.64265   1.90152  0.17403  3.693 0.000222
F6_0.33  -0.90478   0.40463  0.17442 -5.187 2.13e-07
F7_0.15   0.92346   2.51799  0.21641  4.267 1.98e-05
F8_0.36  -0.78747   0.45500  0.15532 -5.070 3.98e-07
F9_0.28   0.83322   2.30072  0.17657  4.719 2.37e-06
F10_0.06  1.98133   7.25241  0.31279  6.334 2.38e-10

Likelihood ratio test=134.7  on 10 df, p=< 2.2e-16
n= 1000, number of events= 503 
head(sim_pos_ctrl[[1]]$betas, 10)
       [,1]
 [1,]  0.00
 [2,] -0.25
 [3,]  0.25
 [4,] -0.50
 [5,]  0.50
 [6,] -0.75
 [7,]  0.75
 [8,] -1.00
 [9,]  1.00
[10,]  2.00

We have an interesting conundrum here. If we generate coefficients less than 1, we are able to reduce the “blowout” that happens as the survival simulator tries to meet the coefficient requirements with zero-inflated data. However, as coefficients decrease below 1, we see highly correlated performance loss in the RSF

Fixed positive control

The variable positive control seems to work okay but we see an interaction in random survival forest accuracy based on frequency and coefficient. Now what we need to do is generate several datasets with fixed frequency and vary the coefficients, essentially generating a matrix of frequency & coeff variation and then we can track performance. Let’s do 10, 20, 40, 60, 80, and 100 percent mutation prevalence.

Generate binary dataset

Generate 1000 samples with 60 variables ranging in prevalence from 10% to 100%

set.seed(12345)
prev_coeff_10 <- binary_generator(samp_num = 1000, var_num = 10, var_freq = 10)
prev_coeff_10 <- apply(prev_coeff_10, 1:2, function(x) x*(sample(5:100,1)/100))

prev_coeff_20 <- binary_generator(samp_num = 1000, var_num = 10, var_freq = 20)
prev_coeff_20 <- apply(prev_coeff_20, 1:2, function(x) x*(sample(5:100,1)/100))

prev_coeff_40 <- binary_generator(samp_num = 1000, var_num = 10, var_freq = 40)
prev_coeff_40 <- apply(prev_coeff_40, 1:2, function(x) x*(sample(5:100,1)/100))

prev_coeff_60 <- binary_generator(samp_num = 1000, var_num = 10, var_freq = 60)
prev_coeff_60 <- apply(prev_coeff_60, 1:2, function(x) x*(sample(5:100,1)/100))

prev_coeff_80 <- binary_generator(samp_num = 1000, var_num = 10, var_freq = 80)
prev_coeff_80 <- apply(prev_coeff_80, 1:2, function(x) x*(sample(5:100,1)/100))

prev_coeff_100 <- binary_generator(samp_num = 1000, var_num = 10, var_freq = 100)
prev_coeff_100 <- apply(prev_coeff_100, 1:2, function(x) x*(sample(5:100,1)/100))
freq_coeff_mat_input <- cbind(prev_coeff_10, prev_coeff_20, prev_coeff_40, prev_coeff_60, prev_coeff_80, prev_coeff_100)
freq_coeff_mat_input <- as.data.frame(freq_coeff_mat_input)
colnames(freq_coeff_mat_input) <- c("F10_C0", "F10_Cn.25", "F10_C.25", "F10_Cn.5", "F10_C.5", "F10_Cn.75", "F10_C.75", "F10_Cn1", "F10_C1", "F10_C2","F20_C0", "F20_Cn.25", "F20_C.25", "F20_Cn.5", "F20_C.5", "F20_Cn.75", "F20_C.75", "F20_Cn1", "F20_C1", "F20_C2","F40_C0", "F40_Cn.25", "F40_C.25", "F40_Cn.5", "F40_C.5", "F40_Cn.75", "F40_C.75", "F40_Cn1", "F40_C1", "F40_C2","F60_C0", "F60_Cn.25", "F60_C.25", "F60_Cn.5", "F60_C.5", "F60_Cn.75", "F60_C.75", "F60_Cn1", "F60_C1", "F60_C2","F80_C0", "F80_Cn.25", "F80_C.25", "F80_Cn.5", "F80_C.5", "F80_Cn.75", "F80_C.75", "F80_Cn1", "F80_C1", "F80_C2","F100_C0", "F100_Cn.25", "F100_C.25", "F100_Cn.5", "F100_C.5", "F100_Cn.75", "F100_C.75", "F100_Cn1", "F100_C1", "F100_C2")

Generate survival data

freq_coeff_mat <- sim.survdata(T=1825, N=1000, num.data.frames = 5, censor=0.5, knots=37, X=freq_coeff_mat_input, beta = rep(c(0,-0.25, 0.25, -0.5, 0.5, -0.75, 0.75, -1, 1, 2), 6))
79 observations have drawn durations
                                    at the minimum or maximum possible value. Generating coefficients
                                    and other quantities of interest are unlikely to be returned
                                    by models due to truncation. Consider making user-supplied coefficients
                                    smaller, making T bigger, or decreasing the variance of the X variables.226 observations have drawn durations
                                    at the minimum or maximum possible value. Generating coefficients
                                    and other quantities of interest are unlikely to be returned
                                    by models due to truncation. Consider making user-supplied coefficients
                                    smaller, making T bigger, or decreasing the variance of the X variables.

Diagnostics for simulated data

survsim.plot(freq_coeff_mat, df=1)
TableGrob (2 x 1) "arrange": 2 grobs

survsim.plot(freq_coeff_mat, df=2)
TableGrob (2 x 1) "arrange": 2 grobs

survsim.plot(freq_coeff_mat, df=3)
TableGrob (2 x 1) "arrange": 2 grobs

survsim.plot(freq_coeff_mat, df=4)
TableGrob (2 x 1) "arrange": 2 grobs

survsim.plot(freq_coeff_mat, df=5)
TableGrob (2 x 1) "arrange": 2 grobs

Based on the warning, samples 2, 3, and 4 seem to have the best results, but this dataset is exceedingly weird, probably because so much of it is forced into a particular pairing of coefficients with frequency. Even samples 2, 3, and 4 appear highly skewed toward the earlier timepoints. This is not necessarily bad, but we may need to note it for future investigation.


Results

Unrealistic genetic data randomForestSRC workflow using default params

Tuning data

tune_genepanel_1 <- tune(Surv(y, failed) ~ ., sim_genepanel[[1]]$data)
tune_genepanel_2 <- tune(Surv(y, failed) ~ ., sim_genepanel[[2]]$data)
tune_genepanel_3 <- tune(Surv(y, failed) ~ ., sim_genepanel[[3]]$data)
tune_genepanel_4 <- tune(Surv(y, failed) ~ ., sim_genepanel[[4]]$data)
tune_genepanel_5 <- tune(Surv(y, failed) ~ ., sim_genepanel[[5]]$data)
message("Model 1")
plot.tune(tune_genepanel_1)
tune_genepanel_1$optimal
message("Model 2")
plot.tune(tune_genepanel_2)
tune_genepanel_2$optimal
message("Model 3")
plot.tune(tune_genepanel_3)
tune_genepanel_3$optimal
message("Model 4")
plot.tune(tune_genepanel_4)
tune_genepanel_4$optimal
message("Model 5")
plot.tune(tune_genepanel_5)
tune_genepanel_5$optimal

Training data for 50/50 frequency data (unrealistic)

rfsrc_genepanel_1 <- rfsrc(Surv(y, failed) ~ ., sim_genepanel[[1]]$data, ntree = 4000, nodesize = 35, mtry=201, importance = TRUE)
rfsrc_genepanel_2 <- rfsrc(Surv(y, failed) ~ ., sim_genepanel[[2]]$data, ntree = 4000, nodesize = 25, mtry=161, importance = TRUE)
rfsrc_genepanel_3 <- rfsrc(Surv(y, failed) ~ ., sim_genepanel[[3]]$data, ntree = 4000, nodesize = 8, mtry=104, importance = TRUE)
rfsrc_genepanel_4 <- rfsrc(Surv(y, failed) ~ ., sim_genepanel[[4]]$data, ntree = 4000, nodesize = 65, mtry=488, importance = TRUE)
rfsrc_genepanel_5 <- rfsrc(Surv(y, failed) ~ ., sim_genepanel[[5]]$data, ntree = 4000, nodesize = 60, mtry=251, importance = TRUE)
rfsrc_genepanel_1 
rfsrc_genepanel_2 
rfsrc_genepanel_3 
rfsrc_genepanel_4 
rfsrc_genepanel_5 

plot(rfsrc_genepanel_1)
plot(gg_rfsrc(rfsrc_genepanel_1))
plot(rfsrc_genepanel_2)
plot(gg_rfsrc(rfsrc_genepanel_2))
plot(rfsrc_genepanel_3)
plot(gg_rfsrc(rfsrc_genepanel_3))
plot(rfsrc_genepanel_4)
plot(gg_rfsrc(rfsrc_genepanel_4))
plot(rfsrc_genepanel_5)
plot(gg_rfsrc(rfsrc_genepanel_5))

These data seem to indicate that for the unrealistic case of balanced binary variables with Cox coefficients normally distributed and maxing out around +/- 0.3, most models will converge by 2000 trees, which will definitely save time simulating as we get deeper into this study. Interestingly, although the coefficients for the variables are normally distributed, variable importance is heavily weighted toward positive importance, and it is not normally distributed.

Now let’s try working with a more realistic example. We’ll use 500 genes, and they’ll have a range of prevalence of mutations from 5% of patients to 40% of patients. This distribution is a little broad to cover most known gene prevalence in cancer, except in cases of cancer with single causative mutations.

More realistic gene panel results

Tuning data

tune_realpanel_1 <- tune(Surv(y, failed) ~ ., sim_realpanel[[1]]$data)
tune_realpanel_2 <- tune(Surv(y, failed) ~ ., sim_realpanel[[2]]$data)
tune_realpanel_3 <- tune(Surv(y, failed) ~ ., sim_realpanel[[3]]$data)
tune_realpanel_4 <- tune(Surv(y, failed) ~ ., sim_realpanel[[4]]$data)
tune_realpanel_5 <- tune(Surv(y, failed) ~ ., sim_realpanel[[5]]$data)
message("Model 1")
plot.tune(tune_realpanel_1)
tune_realpanel_1$optimal
message("Model 2")
plot.tune(tune_realpanel_2)
tune_realpanel_2$optimal
message("Model 3")
plot.tune(tune_realpanel_3)
tune_realpanel_3$optimal
message("Model 4")
plot.tune(tune_realpanel_4)
tune_realpanel_4$optimal
message("Model 5")
plot.tune(tune_realpanel_5)
tune_realpanel_5$optimal

Training data for 50/50 frequency data (realistic)

rfsrc_realpanel_1 <- rfsrc(Surv(y, failed) ~ ., sim_realpanel[[1]]$data, ntree = 2000, nodesize = 9, mtry=300, importance = TRUE)
rfsrc_realpanel_2 <- rfsrc(Surv(y, failed) ~ ., sim_realpanel[[2]]$data, ntree = 2000, nodesize = 55, mtry=293, importance = TRUE)
rfsrc_realpanel_3 <- rfsrc(Surv(y, failed) ~ ., sim_realpanel[[3]]$data, ntree = 2000, nodesize = 7, mtry=188, importance = TRUE)
rfsrc_realpanel_4 <- rfsrc(Surv(y, failed) ~ ., sim_realpanel[[4]]$data, ntree = 2000, nodesize = 25, mtry=63, importance = TRUE)
rfsrc_realpanel_5 <- rfsrc(Surv(y, failed) ~ ., sim_realpanel[[5]]$data, ntree = 2000, nodesize = 25, mtry=151, importance = TRUE)
rfsrc_realpanel_1 
rfsrc_realpanel_2 
rfsrc_realpanel_3 
rfsrc_realpanel_4 
rfsrc_realpanel_5 

plot(rfsrc_realpanel_1)
plot(gg_rfsrc(rfsrc_realpanel_1))
plot(rfsrc_realpanel_2)
plot(gg_rfsrc(rfsrc_realpanel_2))
plot(rfsrc_realpanel_3)
plot(gg_rfsrc(rfsrc_realpanel_3))
plot(rfsrc_realpanel_4)
plot(gg_rfsrc(rfsrc_realpanel_4))
plot(rfsrc_realpanel_5)
plot(gg_rfsrc(rfsrc_realpanel_5))

Interesting. So that still doesn’t work worth a hoot for binary data. Now we have two ways to go: we can try the same type of pipeline but for smaller numbers of genes, or we can try changing to VAF and seeing how the continuous data preferences of random forests translate to RSF. Let’s try using the same matrix but converting the “1” values to continuous values between 0.1 and 1.

Realistic gene sequencing panel data using VAFs, with the previous matrix converted to VAF

I am somewhat concerned here that the coefficients from the continuous variables are not high enough (mostly < abs(0.2)), but we will see if the models can derive anything useful from them. We may have to manually set the coefficients for these data. But first let’s do a baseline, then we can work on setting fixed coefficients and seeing if filtering low coefficients from Cox PH can help.

Tuning data

tune_VAFpanel_1 <- tune(Surv(y, failed) ~ ., sim_realVAF_panel[[1]]$data)
tune_VAFpanel_2 <- tune(Surv(y, failed) ~ ., sim_realVAF_panel[[2]]$data)
tune_VAFpanel_3 <- tune(Surv(y, failed) ~ ., sim_realVAF_panel[[3]]$data)
tune_VAFpanel_4 <- tune(Surv(y, failed) ~ ., sim_realVAF_panel[[4]]$data)
tune_VAFpanel_5 <- tune(Surv(y, failed) ~ ., sim_realVAF_panel[[5]]$data)
message("Model 1")
plot.tune(tune_VAFpanel_1)
tune_VAFpanel_1$optimal
message("Model 2")
plot.tune(tune_VAFpanel_2)
tune_VAFpanel_2$optimal
message("Model 3")
plot.tune(tune_VAFpanel_3)
tune_VAFpanel_3$optimal
message("Model 4")
plot.tune(tune_VAFpanel_4)
tune_VAFpanel_4$optimal
message("Model 5")
plot.tune(tune_VAFpanel_5)
tune_VAFpanel_5$optimal

Training using VAF data derived from same binary data

rfsrc_VAFpanel_1 <- rfsrc(Surv(y, failed) ~ ., sim_realVAF_panel[[1]]$data, ntree = 2000, nodesize = 30, mtry=51, importance = TRUE)
rfsrc_VAFpanel_2 <- rfsrc(Surv(y, failed) ~ ., sim_realVAF_panel[[2]]$data, ntree = 2000, nodesize = 75, mtry=51, importance = TRUE)
rfsrc_VAFpanel_3 <- rfsrc(Surv(y, failed) ~ ., sim_realVAF_panel[[3]]$data, ntree = 2000, nodesize = 25, mtry=33, importance = TRUE)
rfsrc_VAFpanel_4 <- rfsrc(Surv(y, failed) ~ ., sim_realVAF_panel[[4]]$data, ntree = 2000, nodesize = 35, mtry=63, importance = TRUE)
rfsrc_VAFpanel_5 <- rfsrc(Surv(y, failed) ~ ., sim_realVAF_panel[[5]]$data, ntree = 2000, nodesize = 50, mtry=51, importance = TRUE)
rfsrc_VAFpanel_1 
rfsrc_VAFpanel_2 
rfsrc_VAFpanel_3 
rfsrc_VAFpanel_4 
rfsrc_VAFpanel_5 

plot(rfsrc_VAFpanel_1)
plot(gg_rfsrc(rfsrc_VAFpanel_1))
plot(rfsrc_VAFpanel_2)
plot(gg_rfsrc(rfsrc_VAFpanel_2))
plot(rfsrc_VAFpanel_3)
plot(gg_rfsrc(rfsrc_VAFpanel_3))
plot(rfsrc_VAFpanel_4)
plot(gg_rfsrc(rfsrc_VAFpanel_4))
plot(rfsrc_VAFpanel_5)
plot(gg_rfsrc(rfsrc_VAFpanel_5))

Realistic gene sequencing panel data using VAFs, selected down based on coefficient

The performance for the previous models was relatively abysmal (rarely above 0.45% accuracy despite Cox capturing nearly all of the variability). Let’s see if we can get better performance by forcibly reducing the data down to only the highest coefficients. Since most coefficients are below abs(1.5), let’s try that first.

Tuning data

tune_VAFpanel_reduc_1 <- tune(Surv(y, failed) ~ ., as.data.frame(sim_realVAF_panel_reduc[[1]]))
tune_VAFpanel_reduc_2 <- tune(Surv(y, failed) ~ ., as.data.frame(sim_realVAF_panel_reduc[[2]]))
tune_VAFpanel_reduc_3 <- tune(Surv(y, failed) ~ ., as.data.frame(sim_realVAF_panel_reduc[[3]]))
tune_VAFpanel_reduc_4 <- tune(Surv(y, failed) ~ ., as.data.frame(sim_realVAF_panel_reduc[[4]]))
tune_VAFpanel_reduc_5 <- tune(Surv(y, failed) ~ ., as.data.frame(sim_realVAF_panel_reduc[[5]]))
message("Model 1")
plot.tune(tune_VAFpanel_reduc_1)
tune_VAFpanel_reduc_1$optimal
message("Model 2")
plot.tune(tune_VAFpanel_reduc_2)
tune_VAFpanel_reduc_2$optimal
message("Model 3")
plot.tune(tune_VAFpanel_reduc_3)
tune_VAFpanel_reduc_3$optimal
message("Model 4")
plot.tune(tune_VAFpanel_reduc_4)
tune_VAFpanel_reduc_4$optimal
message("Model 5")
plot.tune(tune_VAFpanel_reduc_5)
tune_VAFpanel_reduc_5$optimal

Training using VAF data derived from same binary data

rfsrc_VAFpanel_reduc_1 <- rfsrc(Surv(y, failed) ~ ., sim_realVAF_panel_reduc[[1]], ntree = 2000, nodesize = 8, mtry=7, importance = TRUE)
rfsrc_VAFpanel_reduc_2 <- rfsrc(Surv(y, failed) ~ ., sim_realVAF_panel_reduc[[2]], ntree = 2000, nodesize = 50, mtry=8, importance = TRUE)
rfsrc_VAFpanel_reduc_3 <- rfsrc(Surv(y, failed) ~ ., sim_realVAF_panel_reduc[[3]], ntree = 2000, nodesize = 75, mtry=7, importance = TRUE)
rfsrc_VAFpanel_reduc_4 <- rfsrc(Surv(y, failed) ~ ., sim_realVAF_panel_reduc[[4]], ntree = 2000, nodesize = 35, mtry=8, importance = TRUE)
rfsrc_VAFpanel_reduc_5 <- rfsrc(Surv(y, failed) ~ ., sim_realVAF_panel_reduc[[5]], ntree = 2000, nodesize = 20, mtry=9, importance = TRUE)
rfsrc_VAFpanel_reduc_1 
rfsrc_VAFpanel_reduc_2 
rfsrc_VAFpanel_reduc_3 
rfsrc_VAFpanel_reduc_4 
rfsrc_VAFpanel_reduc_5 

plot(rfsrc_VAFpanel_reduc_1)
plot(gg_rfsrc(rfsrc_VAFpanel_reduc_1))
plot(rfsrc_VAFpanel_reduc_2)
plot(gg_rfsrc(rfsrc_VAFpanel_reduc_2))
plot(rfsrc_VAFpanel_reduc_3)
plot(gg_rfsrc(rfsrc_VAFpanel_reduc_3))
plot(rfsrc_VAFpanel_reduc_4)
plot(gg_rfsrc(rfsrc_VAFpanel_reduc_4))
plot(rfsrc_VAFpanel_reduc_5)
plot(gg_rfsrc(rfsrc_VAFpanel_reduc_5))

Interestingly, it seems like even if we select down to just the highest absolute values of coefficients (roughly 50 vars per model), it’s still not enough at these coefficient levels. We may need much higher HRs among variables to detect them successfully. This was a rather naive approach in retrospect. A better approach will probably be to find a positive control and then fiddle “upward” until I find the tolerances of the random survival forest.

Positive control models

We’ll try generating positive controls that mimic real-world genetic data using VAFs and see if we can actually decently model them with a minimum number of variables. One would expect that 20 random variables would be sufficient to generate good coefficients, but many of these coefficients seem relatively low. We’ll see what we can do with this, but I remain less than hopeful.

Tuning data

tune_posctrl_1 <- tune(Surv(y, failed) ~ ., sim_pos_ctrl[[1]]$data)
tune_posctrl_2 <- tune(Surv(y, failed) ~ ., sim_pos_ctrl[[2]]$data)
tune_posctrl_3 <- tune(Surv(y, failed) ~ ., sim_pos_ctrl[[3]]$data)
tune_posctrl_4 <- tune(Surv(y, failed) ~ ., sim_pos_ctrl[[4]]$data)
tune_posctrl_5 <- tune(Surv(y, failed) ~ ., sim_pos_ctrl[[5]]$data)
message("Model 1")
plot.tune(tune_posctrl_1)
tune_posctrl_1$optimal
message("Model 2")
plot.tune(tune_posctrl_2)
tune_posctrl_2$optimal
message("Model 3")
plot.tune(tune_posctrl_3)
tune_posctrl_3$optimal
message("Model 4")
plot.tune(tune_posctrl_4)
tune_posctrl_4$optimal
message("Model 5")
plot.tune(tune_posctrl_5)
tune_posctrl_5$optimal

Training using VAF data derived from same binary data

rfsrc_posctrl_1 <- rfsrc(Surv(y, failed) ~ ., sim_pos_ctrl[[1]]$data, ntree = 2000, nodesize = 60, mtry=2, importance = TRUE)
rfsrc_posctrl_2 <- rfsrc(Surv(y, failed) ~ ., sim_pos_ctrl[[2]]$data, ntree = 2000, nodesize = 70, mtry=4, importance = TRUE)
rfsrc_posctrl_3 <- rfsrc(Surv(y, failed) ~ ., sim_pos_ctrl[[3]]$data, ntree = 2000, nodesize = 40, mtry=3, importance = TRUE)
rfsrc_posctrl_4 <- rfsrc(Surv(y, failed) ~ ., sim_pos_ctrl[[4]]$data, ntree = 2000, nodesize = 75, mtry=1, importance = TRUE)
rfsrc_posctrl_5 <- rfsrc(Surv(y, failed) ~ ., sim_pos_ctrl[[5]]$data, ntree = 2000, nodesize = 55, mtry=2, importance = TRUE)
rfsrc_posctrl_1 
                         Sample size: 1000
                    Number of deaths: 503
                     Number of trees: 2000
           Forest terminal node size: 60
       Average no. of terminal nodes: 20.2145
No. of variables tried at each split: 2
              Total no. of variables: 10
       Resampling used to grow trees: swor
    Resample size used to grow trees: 632
                            Analysis: RSF
                              Family: surv
                      Splitting rule: logrank *random*
       Number of random split points: 10
                          Error rate: 38.93%
rfsrc_posctrl_2 
                         Sample size: 1000
                    Number of deaths: 526
                     Number of trees: 2000
           Forest terminal node size: 70
       Average no. of terminal nodes: 18.068
No. of variables tried at each split: 4
              Total no. of variables: 10
       Resampling used to grow trees: swor
    Resample size used to grow trees: 632
                            Analysis: RSF
                              Family: surv
                      Splitting rule: logrank *random*
       Number of random split points: 10
                          Error rate: 41.07%
rfsrc_posctrl_3 
                         Sample size: 1000
                    Number of deaths: 486
                     Number of trees: 2000
           Forest terminal node size: 40
       Average no. of terminal nodes: 32.648
No. of variables tried at each split: 3
              Total no. of variables: 10
       Resampling used to grow trees: swor
    Resample size used to grow trees: 632
                            Analysis: RSF
                              Family: surv
                      Splitting rule: logrank *random*
       Number of random split points: 10
                          Error rate: 39.02%
rfsrc_posctrl_4 
                         Sample size: 1000
                    Number of deaths: 496
                     Number of trees: 2000
           Forest terminal node size: 75
       Average no. of terminal nodes: 14.5425
No. of variables tried at each split: 1
              Total no. of variables: 10
       Resampling used to grow trees: swor
    Resample size used to grow trees: 632
                            Analysis: RSF
                              Family: surv
                      Splitting rule: logrank *random*
       Number of random split points: 10
                          Error rate: 36.26%
rfsrc_posctrl_5 
                         Sample size: 1000
                    Number of deaths: 520
                     Number of trees: 2000
           Forest terminal node size: 55
       Average no. of terminal nodes: 23.134
No. of variables tried at each split: 2
              Total no. of variables: 10
       Resampling used to grow trees: swor
    Resample size used to grow trees: 632
                            Analysis: RSF
                              Family: surv
                      Splitting rule: logrank *random*
       Number of random split points: 10
                          Error rate: 37.05%
plot(rfsrc_posctrl_1)

plot(gg_rfsrc(rfsrc_posctrl_1))

plot(rfsrc_posctrl_2)

plot(gg_rfsrc(rfsrc_posctrl_2))

plot(rfsrc_posctrl_3)

plot(gg_rfsrc(rfsrc_posctrl_3))

plot(rfsrc_posctrl_4)

plot(gg_rfsrc(rfsrc_posctrl_4))

plot(rfsrc_posctrl_5)

plot(gg_rfsrc(rfsrc_posctrl_5))

Well, this appears promising, but it is also appears to show that the plots for 1000 patients seems to be a bit misleading because the resolution is simply too low. Given the accuracy of the model, it appears that the lower events may represent patients who experience relapse late in their disease course. In any case, let’s try it with something more approximating genetic data.

It seems the source of the problems may be the zero-inflation of genetic data. But we need a way to get around that if possible. The sparsity means rather low coefficients for survival in most cases.

If we reduce the coefficients to minimize the “blowout”, we also see a directly correlated loss in OOB performance. Interestingly, most of the RSF models seem to favor higher coefficients in importance, but there appears to be some interaction between prevalence and coefficient. Even though F10 has the highest coefficient in this model, it also has the lowest prevalence, and F6-9 have lower coefficients but higher prevalence and are often ranked higher in variable importance. However, this could also be the effect of the zero-inflation.

Freq x coeff models

We’ll try generating positive controls that mimic real-world genetic data using VAFs and see if we can actually decently model them with a minimum number of variables. One would expect that 20 random variables would be sufficient to generate good coefficients, but many of these coefficients seem relatively low. We’ll see what we can do with this, but I remain less than hopeful.

Tuning data

tune_fvc_1 <- tune(Surv(y, failed) ~ ., freq_coeff_mat[[1]]$data)
tune_fvc_2 <- tune(Surv(y, failed) ~ ., freq_coeff_mat[[2]]$data)
tune_fvc_3 <- tune(Surv(y, failed) ~ ., freq_coeff_mat[[3]]$data)
tune_fvc_4 <- tune(Surv(y, failed) ~ ., freq_coeff_mat[[4]]$data)
tune_fvc_5 <- tune(Surv(y, failed) ~ ., freq_coeff_mat[[5]]$data)
message("Model 1")
Model 1
plot.tune(tune_fvc_1)

tune_fvc_1$optimal
nodesize     mtry 
      15       11 
message("Model 2")
Model 2
plot.tune(tune_fvc_2)

tune_fvc_2$optimal
nodesize     mtry 
      45       13 
message("Model 3")
Model 3
plot.tune(tune_fvc_3)

tune_fvc_3$optimal
nodesize     mtry 
      25       11 
message("Model 4")
Model 4
plot.tune(tune_fvc_4)

tune_fvc_4$optimal
nodesize     mtry 
      40       11 
message("Model 5")
Model 5
plot.tune(tune_fvc_5)

tune_fvc_5$optimal
nodesize     mtry 
      15       20 

Training using VAF data derived from same binary data

rfsrc_fvc_1 <- rfsrc(Surv(y, failed) ~ ., freq_coeff_mat[[1]]$data, ntree = 2000, nodesize = 15, mtry=11, importance = TRUE)
rfsrc_fvc_2 <- rfsrc(Surv(y, failed) ~ ., freq_coeff_mat[[2]]$data, ntree = 2000, nodesize = 45, mtry=13, importance = TRUE)
rfsrc_fvc_3 <- rfsrc(Surv(y, failed) ~ ., freq_coeff_mat[[3]]$data, ntree = 2000, nodesize = 25, mtry=11, importance = TRUE)
rfsrc_fvc_4 <- rfsrc(Surv(y, failed) ~ ., freq_coeff_mat[[4]]$data, ntree = 2000, nodesize = 40, mtry=11, importance = TRUE)
rfsrc_fvc_5 <- rfsrc(Surv(y, failed) ~ ., freq_coeff_mat[[5]]$data, ntree = 2000, nodesize = 15, mtry=20, importance = TRUE)
rfsrc_fvc_1 
                         Sample size: 1000
                    Number of deaths: 513
                     Number of trees: 2000
           Forest terminal node size: 15
       Average no. of terminal nodes: 67.6305
No. of variables tried at each split: 11
              Total no. of variables: 60
       Resampling used to grow trees: swor
    Resample size used to grow trees: 632
                            Analysis: RSF
                              Family: surv
                      Splitting rule: logrank *random*
       Number of random split points: 10
                          Error rate: 26.03%
rfsrc_fvc_2 
                         Sample size: 1000
                    Number of deaths: 516
                     Number of trees: 2000
           Forest terminal node size: 45
       Average no. of terminal nodes: 21.2955
No. of variables tried at each split: 13
              Total no. of variables: 60
       Resampling used to grow trees: swor
    Resample size used to grow trees: 632
                            Analysis: RSF
                              Family: surv
                      Splitting rule: logrank *random*
       Number of random split points: 10
                          Error rate: 28.81%
rfsrc_fvc_3 
                         Sample size: 1000
                    Number of deaths: 506
                     Number of trees: 2000
           Forest terminal node size: 25
       Average no. of terminal nodes: 41.8105
No. of variables tried at each split: 11
              Total no. of variables: 60
       Resampling used to grow trees: swor
    Resample size used to grow trees: 632
                            Analysis: RSF
                              Family: surv
                      Splitting rule: logrank *random*
       Number of random split points: 10
                          Error rate: 27.23%
rfsrc_fvc_4 
                         Sample size: 1000
                    Number of deaths: 510
                     Number of trees: 2000
           Forest terminal node size: 40
       Average no. of terminal nodes: 23.3675
No. of variables tried at each split: 11
              Total no. of variables: 60
       Resampling used to grow trees: swor
    Resample size used to grow trees: 632
                            Analysis: RSF
                              Family: surv
                      Splitting rule: logrank *random*
       Number of random split points: 10
                          Error rate: 29.98%
rfsrc_fvc_5 
                         Sample size: 1000
                    Number of deaths: 470
                     Number of trees: 2000
           Forest terminal node size: 15
       Average no. of terminal nodes: 72.464
No. of variables tried at each split: 20
              Total no. of variables: 60
       Resampling used to grow trees: swor
    Resample size used to grow trees: 632
                            Analysis: RSF
                              Family: surv
                      Splitting rule: logrank *random*
       Number of random split points: 10
                          Error rate: 27.53%
plot(rfsrc_fvc_1)

plot(gg_rfsrc(rfsrc_fvc_1))

plot(rfsrc_fvc_2)

plot(gg_rfsrc(rfsrc_fvc_2))

plot(rfsrc_fvc_3)

plot(gg_rfsrc(rfsrc_fvc_3))

plot(rfsrc_fvc_4)

plot(gg_rfsrc(rfsrc_fvc_4))

plot(rfsrc_fvc_5)

plot(gg_rfsrc(rfsrc_fvc_5))

Variable importance calculations

test <- data.frame(varname=names(rfsrc_fvc_3[["importance"]]), importance = rfsrc_fvc_3[["importance"]])

test %>% ggplot() + geom_bar(aes(x=reorder(varname, -importance), y=importance), stat="identity") + theme(axis.text.x = element_text(angle=90, hjust=1, vjust=0.5)) + labs(title="Variable Importance", x="Variables", y="Variable Importance - Permutation change in OOB")

Setup code to generate colorized heatmap of frequency versus coefficient

fvc1 <- data.frame(varname=names(rfsrc_fvc_1[["importance"]]), importance = rfsrc_fvc_1[["importance"]])
fvc1$freq <- c(rep(10,10), rep(20,10), rep(40,10), rep(60,10), rep(80,10), rep(100,10))
fvc1$coeff <- rep(c(0,-0.25, 0.25, -0.5, 0.5, -0.75, 0.75, -1, 1, 2), 6)

fvc2 <- data.frame(varname=names(rfsrc_fvc_2[["importance"]]), importance = rfsrc_fvc_2[["importance"]])
fvc2$freq <- c(rep(10,10), rep(20,10), rep(40,10), rep(60,10), rep(80,10), rep(100,10))
fvc2$coeff <- rep(c(0,-0.25, 0.25, -0.5, 0.5, -0.75, 0.75, -1, 1, 2), 6)

fvc3 <- data.frame(varname=names(rfsrc_fvc_3[["importance"]]), importance = rfsrc_fvc_3[["importance"]])
fvc3$freq <- c(rep(10,10), rep(20,10), rep(40,10), rep(60,10), rep(80,10), rep(100,10))
fvc3$coeff <- rep(c(0,-0.25, 0.25, -0.5, 0.5, -0.75, 0.75, -1, 1, 2), 6)

fvc4 <- data.frame(varname=names(rfsrc_fvc_4[["importance"]]), importance = rfsrc_fvc_4[["importance"]])
fvc4$freq <- c(rep(10,10), rep(20,10), rep(40,10), rep(60,10), rep(80,10), rep(100,10))
fvc4$coeff <- rep(c(0,-0.25, 0.25, -0.5, 0.5, -0.75, 0.75, -1, 1, 2), 6)

fvc5 <- data.frame(varname=names(rfsrc_fvc_5[["importance"]]), importance = rfsrc_fvc_5[["importance"]])
fvc5$freq <- c(rep(10,10), rep(20,10), rep(40,10), rep(60,10), rep(80,10), rep(100,10))
fvc5$coeff <- rep(c(0,-0.25, 0.25, -0.5, 0.5, -0.75, 0.75, -1, 1, 2), 6)
ggplot(fvc1) + geom_contour_filled(aes(x=freq, y=coeff, z=importance)) + ggtitle("Simulation 1")

ggplot(fvc1) + geom_tile(aes(x=freq, y=coeff, fill=importance)) + scale_fill_gradient(low="yellow",high="blue") + ggtitle("Simulation 1")


ggplot(fvc2) + geom_contour_filled(aes(x=freq, y=coeff, z=importance))+ ggtitle("Simulation 2")

ggplot(fvc2) + geom_tile(aes(x=freq, y=coeff, fill=importance)) + scale_fill_gradient(low="yellow",high="blue") + ggtitle("Simulation 2")


ggplot(fvc3) + geom_contour_filled(aes(x=freq, y=coeff, z=importance)) + ggtitle("Simulation 3")

ggplot(fvc3) + geom_tile(aes(x=freq, y=coeff, fill=importance)) + scale_fill_gradient(low="yellow",high="blue") + ggtitle("Simulation 3")


ggplot(fvc4) + geom_contour_filled(aes(x=freq, y=coeff, z=importance)) + ggtitle("Simulation 4")

ggplot(fvc4) + geom_tile(aes(x=freq, y=coeff, fill=importance)) + scale_fill_gradient(low="yellow",high="blue") + ggtitle("Simulation 4")


ggplot(fvc5) + geom_contour_filled(aes(x=freq, y=coeff, z=importance)) + ggtitle("Simulation 5")

ggplot(fvc5) + geom_tile(aes(x=freq, y=coeff, fill=importance)) + scale_fill_gradient(low="yellow",high="blue") + ggtitle("Simulation 5")

Investigate criteria for variable selection

There are some outstanding questions how frequency interacts with the Wald p-value and other calculations. Specifically, we would like to understand how frequency affects the coefficient and its p-value in univariate Cox maximum likelihood estimates.

We can better understand this by calculating the univariate Cox PH per variable per model and see if there are any important trends.

This code is modified from http://www.sthda.com/english/wiki/cox-proportional-hazards-model#compute-the-cox-model

covariates <- colnames(freq_coeff_mat[[1]]$data[,1:60])
univ_formulas <- sapply(covariates,
                        function(x) as.formula(paste('Surv(y, failed)~', x)))
fc1_cox_models <- lapply( univ_formulas, function(x){coxph(x, data = freq_coeff_mat[[1]]$data)})
# Extract data 
fc1_cox_results <- lapply(fc1_cox_models,
                       function(x){ 
                          x <- summary(x)
                          p.value<-signif(x$wald["pvalue"], digits=2)
                          wald.test<-signif(x$wald["test"], digits=2)
                          beta<-signif(x$coef[1], digits=2);#coeficient beta
                          HR <-signif(x$coef[2], digits=2);#exp(beta)
                          HR.confint.lower <- signif(x$conf.int[,"lower .95"], 2)
                          HR.confint.upper <- signif(x$conf.int[,"upper .95"],2)
                          HR <- paste0(HR, " (", 
                                       HR.confint.lower, "-", HR.confint.upper, ")")
                          res<-c(beta, HR, wald.test, p.value)
                          names(res)<-c("beta", "HR (95% CI for HR)", "wald.test", 
                                        "p.value")
                          return(res)
                          #return(exp(cbind(coef(x),confint(x))))
                         })
fc1_cox_results <- t(as.data.frame(fc1_cox_results, check.names = FALSE))
as.data.frame(fc1_cox_results)
fc1_cox_results$freq <- c(rep(10,10), rep(20,10), rep(40,10), rep(60,10), rep(80,10), rep(100,10))
fc1_cox_results$coeff <- rep(c(0,-0.25, 0.25, -0.5, 0.5, -0.75, 0.75, -1, 1, 2), 6)
ggplot(fc1_cox_results) + geom_text(aes(x=as.numeric(as.character(beta)), y=as.numeric(as.character(p.value)), label=rownames(fc1_cox_results)))


ggplot(fc1_cox_results) + geom_contour_filled(aes(x=freq, y=coeff, z=as.numeric(as.character(beta)))) + ggtitle("Simulation 1") + labs(fill="Cox Coeff")


ggplot(fc1_cox_results) + geom_tile(aes(x=freq, y=coeff, fill=as.numeric(as.character(beta)))) + scale_fill_gradient(low="yellow",high="blue") + ggtitle("Cox Coeff")

ggplot(fc1_cox_results) + geom_contour_filled(aes(x=freq, y=coeff, z=as.numeric(as.character(p.value)))) + ggtitle("Simulation 1") + labs(fill="Wald P-value")


ggplot(fc1_cox_results) + geom_tile(aes(x=freq, y=coeff, fill=as.numeric(as.character(p.value)))) + scale_fill_gradient(low="blue",high="yellow") + ggtitle("Wald P-value")

Conclusions

Random survival forests appear to show some variability in terms of effectively selecting simulated “genomic-like” data when variables are simulated to represent VAF scores. In variables with univariable Cox proportional hazards lower than 1, random survival forests are unable to effectively identify important variables.

Furthermore, zero inflation appears to be important for random survival forest performance. Genes that have simulated “mutational frequency” less than 20% appear to be less reliable than more frequently mutated genes.

Both of these results are almost certainly related to the splitting process in individual trees of the random survival forest. Most likely, the effects of rare and/or low-coefficient genes are so small that most of the weak predictors in the random survival forest ensemble are unable to effectively split the data.

---
title: "Random Survival Forest Simulator Data Generator"
author: Zach Bohannan
date: Oct 10, 2020
output: html_notebook
---

# Background
## Libraries
```{r}
library(tidyverse)
library(coxed)
library(survival)
library(randomForestSRC)

library(ggRandomForests)

# Akima is simply a supplementary library to support the graphing of tuning data from randomForestSRC
library(akima)
```

## Introduction
### RSF and justification
The random survival forest (RSF) method was recently developed as a powerful machine learning tool for predicting survival, primarily in clinical populations. RSF extends traditional random forests (Breiman) to model cumulative hazard functions for individual patients using an ensemble of decision trees (Ishwaran), with each prediction being informed by the cumulative hazard function of a training population. These models represent a powerful alternative to traditional Cox proportional hazards models and time-to-event analyses.

RSFs have been implemented in several different contexts in both R and python. Both `RandomForestSRC` (https://cran.r-project.org/web/packages/randomForestSRC/citation.html) and `ranger` (https://cran.r-project.org/web/packages/ranger/citation.html) are common R-based implementations of RSF, and python-based RSF has recently been implemented in the `scikit-survival` package (https://scikit-survival.readthedocs.io/en/latest/cite.html). These implementations have been benchmarked on a variety of data as well as being independently supported by third-party research findings.

However, despite intense interest and use, few comprehensive benchmarking studies have been performed on these models. Most third-party benchmarking studies of RSF are focused on a few key aspects of the implementation or data types (for example, https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-017-0383-8). Furthermore, most benchmarking studies are based on a presumption of a particular type or quality level of real-world clinical data, which represents a relatively narrow use case and makes it difficult to fully understand the applicability of RSF methods to common clinical data scenarios such as unbalanced data or small sample sizes. These challenges are exacerbated by the wealth of tuning options for RSF implementations (`RandomForestSRC` calls have 34 listed arguments, and `ranger` calls have 38, although some of those arguments are unlikely to affect predictive performance). The combination of current knowledge gaps and a large number of prospective tuning options makes it nearly impossible to adequately optimize random survival forests for a given real-world use case without detailed background knowledge of model behavior. The goal of this study is to perform RSF benchmarking using simulated gene and survival data.

### Survival data simulation details
Harden (2018) is critical to this endeavor. He describes a relatively elegant solution to simulating survival data that builds the survival and hazard functions in a stepwise manner. First it randomly selects timepoints along the desired duration, and then it randomly selects the same number of values in the [0,1] interval, with 0 and 1 serving as anchor points at time 0 and the max time. These Y values are sorted and then assigned to each chosen timepoint. A cubic spline is constructed along the points using Hyman's method (1983) to ensure montonicity. This represents the cumulative density function (CDF) of the hazard model. The probability density function (PDF) is constructed from the differences between each point . The survival function can be relatively easily constructed by inverting the CDF. Finally, the baseline hazard function is computed by dividing the failure PDF from the survival function (See Fig 1 in Harden).

Covariates are randomly generated (or given by the analyst), and then "true" coefficients are randomly generated as well (these can also be generated by the analyst, which will allow me to set uninformative variables). The combination of covariate and coefficient values represent a linear predictor. This linear predictor is used to calculate exp(XBeta), which represents the percent change from the baseline survival function for a given observation (e.g., 50% lower at all points). Then you randomly select a value from U[0,1] and find the timepoint when the calculated survival function drops below that value. This represents the duration for a given patient. Censoring is calculated independently based on the presumption that censorship mechanisms are not linked to the data generating process (Jackson 2014), which is a reasonable assumption.

## Plan
### Potential data features
#### Variable types
* Binary (e.g., mutation)
  + Roughly 1000 variables, maybe 500 (corresponds with Illumina TruSight 500 panel)
* Ordinal (e.g., severity or stage)
  + 10 (ordinal data are relatively uncommon)
* Unordered factor (e.g., race)
  + 10 UF data are also relatively uncommon
* Continuous percentage (e.g., VAF)
  + Roughly 1000 variables, maybe 500
* Continuous larger numbers with log-norm distribution (e.g., TPM; https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1240081/) 
  + Ask Brian C
* Continuous larger numbers with norm distribution (e.g., WBC)
  + 100 variables

#### Percentage of informative variables
* All (100%)
* High (75%)
* Moderate (50%)
* Low (25%)
* Very few (5%)


#### Data balance
* Balanced
* Minor imbalance pos (60/40)
* Minor imbalance neg (40/60)
* Major imbalance pos (75/25)
* Major imbalance neg (25/75)
* Rare events pos (95/5)
* Rare events neg (5/95)

#### Data avail
Log scale values from lots to little
```{r}
exp(seq(log(50), log(1000), length.out = 10))
```
Seems reasonable to use:
50, 70, 97, 136, 189, 264, 368, 514, 717, 1000

### Simulation needs
2000 samples
* Cite Hendry (2014)?; https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.5945
* 1000 samples 0 event
* 1000 samples 1 event
* Can use (https://cran.r-project.org/web/packages/coxed/vignettes/simulating_survival_data.html) to simulate data

Select a random number of each type of sample to equal the percentages given in the above log list. This represents 7 possibilities across 10 data size categories or a matrix of 70 datasets. Then we should select the number of features to be reasonable for a given type of data. 

Start the simulations with homogeneous data types representing "typical" variable numbers:
* Binary
* Continuous percentage
* Continuous log-norm

Each have the 5 levels of informative data
* Randomly filtered
* Uninformative removal

With this list of combinations, I'm simulating 1050 different data sets. Rather, I'm combining some smaller amount of simulated data into 1050 different combinations.


# Methods
## Generic plotting function for randomForestSRC
```{r}
plot.tune <- function(o, linear = TRUE) {
  x <- o$results[,1]
  y <- o$results[,2]
  z <- o$results[,3]
  so <- interp(x=x, y=y, z=z, linear = linear)
  idx <- which.min(z)
  x0 <- x[idx]
  y0 <- y[idx]
  filled.contour(x = so$x,
                 y = so$y,
                 z = so$z,
                 xlim = range(so$x, finite = TRUE) + c(-2, 2),
                 ylim = range(so$y, finite = TRUE) + c(-2, 2),
                 color.palette =colorRampPalette(c("blue", "orange")),
                 xlab = "nodesize",
                 ylab = "mtry",
                 main = "OOB error for nodesize and mtry in tune.rfrsc",
                 key.title = title(main = "OOB error", cex.main = 1),
                 plot.axes = {axis(1); axis(2); points(x0,y0,pch="x",cex=1,font=2); points(x,y,pch=16,cex=.25)})
  }
```
## Initial testing and scoping
To start with the project and learning how to use `coxed`, let's focus on just using binary variables. In this case, the simulation will be:

Simulation Parameter | Value
-------------------- | -----
Sample number        | 2000
Case balance         | Balanced (50/50)
Duration             | 365 days
Variable type        | Normal
Variable number      | 500
Percentage inform    | Norm Dist Coeff

### coxed Tutorial
```{r}
tutorial <- sim.survdata(N=2000, T=365, num.data.frames = 1, censor=0.5, xvars = 500)
head(tutorial$data, 10)
summary(tutorial$data$failed)

ggplot(tutorial$data) + geom_histogram(aes(x=X1)) + ggtitle("Distribution of first sim var")
ggplot() + geom_histogram(aes(x=tutorial$betas)) + ggtitle("Distribution of 'true' coefficients of simulated data")
```

```{r}
tutorial_cox <- coxph(Surv(y, failed) ~ ., data=tutorial$data)
```

These results seem to meet most of our needs. Now we should try it with the types of data we want. We'll generate our own `X`, which is the covariates, and allow the model to randomly assign coefficients. We can do this 5 times as a representative example and plot the various hazard functions.

## Genetic data generation
### Binary data matrix
2000 observations, each with 500 binary variables, representing genetic data taken from a large cancer screen (Illumina TruSight is roughly 500 genes)

Simulation Parameter | Value
-------------------- | -----
Sample number        | 2000
Case balance         | Balanced (50/50)
Duration             | 1825 days (5 years)
Variable type        | Binary
Variable number      | 500
Percentage inform    | Norm Dist (Random)

#### Variable generator function
```{r}
# This function generates a binary data matrix, as might be seen in mutation hotspot data or sequencing panels. var_freq is the integer value of percent positivity in the sample. It can be a single value, in which case the matrix has the same prevalence for every variable, or it can have 2 values, representing the range of possible frequencies. for frequency ranges, the columns are named by their generated freq.
binary_generator <- function(samp_num, var_num, var_freq = 50) {
  if(length(var_freq) == 1){
    message('binary_generator: Fixed frequency mode, fast generation. Pos freq = ',var_freq)
    data_mat <- as.data.frame(matrix(sample(0:1, samp_num * var_num, replace = TRUE, prob = c(1-var_freq/100, var_freq/100)), samp_num, var_num))
  }
  if(length(var_freq) == 2){
    message('binary_generator: Variable frequency mode. Pos freq = ', var_freq[1], ":", var_freq[2])
    freq_matrix <- sample(var_freq[1]:var_freq[2], var_num, replace=TRUE)/100
    for(i in seq_along(freq_matrix)){
      if(i == 1){
        data_mat <- as.data.frame(sample(0:1, samp_num, replace = TRUE, prob = c(1-freq_matrix[i], freq_matrix[i])))
        colnames(data_mat) <- paste0("F",i,"_",freq_matrix[i])
      } else {
        temp <- data.frame(col_tmp = sample(0:1, samp_num, replace = TRUE, prob = c(1-freq_matrix[i], freq_matrix[i])))
        colnames(temp) <- paste0("F",i,"_",freq_matrix[i])
        data_mat <- cbind(data_mat, temp)
      }
    }
  }  
  data_mat
}
```

### Generate initially scoped data matrix
```{r}
set.seed(12345)
gene_panel_input <- binary_generator(samp_num = 2000, var_num = 500)
head(gene_panel_input)
```
### Generate 5 simulated survival datasets
```{r}
sim_genepanel <- sim.survdata(T=1825, num.data.frames = 5, censor=0.5, X=gene_panel_input, knots=37)
```

In this case, the knots are quite variable and definitely affect the distributions of times. If the number of knots is too low, it results in very punctate simulated durations with many durations clustered around a few timepoints. In the case of large time scales (e.g., simulated 5-year survival), T/50 seems to be a reasonable approximation leading to a good variety of data outputs. 

### Diagnostics for simulated data
```{r}
survsim.plot(sim_genepanel, df=1)
survsim.plot(sim_genepanel, df=2)
survsim.plot(sim_genepanel, df=3)
survsim.plot(sim_genepanel, df=4)
survsim.plot(sim_genepanel, df=5)
```

```{r}
coxph(Surv(y, failed) ~ ., data=sim_genepanel[[1]]$data)
head(sim_genepanel[[1]]$betas, 5)
```


## More realistic binary data matrix
This is very similar to before, but the variables have a more realistic distribution in terms of frequency (ranging from 5% to 40% of samples). We're going to cut down the sample sizes a bit too because those were really big models that took way too long previously.

Simulation Parameter | Value
-------------------- | -----
Sample number        | 1000
Case balance         | Balanced (50/50)
Duration             | 1825 days (5 years)
Variable type        | Binary
Variable number      | 300
Percentage inform    | Norm Dist (Random)

```{r}
set.seed(12345)
realistic_panel_input <- binary_generator(samp_num = 1000, var_num = 300, var_freq = c(5,40))
head(realistic_panel_input)
```

### Generate 5 simulated survival datasets
```{r}
sim_realpanel <- sim.survdata(T=1825, num.data.frames = 5, censor=0.5, X=realistic_panel_input, knots=37)
```

### Diagnostics for simulated data
```{r}
survsim.plot(sim_realpanel, df=1)
survsim.plot(sim_realpanel, df=2)
survsim.plot(sim_realpanel, df=3)
survsim.plot(sim_realpanel, df=4)
survsim.plot(sim_realpanel, df=5)
```

```{r}
coxph(Surv(y, failed) ~ ., data=sim_realpanel[[1]]$data)
head(sim_realpanel[[1]]$betas, 5)
```

```{r}
ggplot() + geom_histogram(aes(x=sim_realpanel[[1]]$betas)) + ggtitle("Distribution of 'true' coefficients of simulated real panel")
```

## VAF-style panel data
This is derived wholly from the "realistic" panel data, but it's updated to use VAFs instead of binary data. These VAFs are randomly generated among possible VAFs, with a cap of 1 (germline homozygous variant) and a floor of 0.05, which represents the minimum VAF that can be reliably detected for most variant callers.
```{r}
real_vaf_panel_input <- realistic_panel_input
real_var_panel_input <- apply(real_vaf_panel_input, 1:2, function(x) x*(sample(5:100,1)/100))
```

### Generate 5 simulated survival datasets
```{r}
sim_realVAF_panel <- sim.survdata(T=1825, num.data.frames = 5, censor=0.5, X=real_var_panel_input, knots=37)
```

### Diagnostics for simulated data
```{r}
survsim.plot(sim_realVAF_panel, df=1)
survsim.plot(sim_realVAF_panel, df=2)
survsim.plot(sim_realVAF_panel, df=3)
survsim.plot(sim_realVAF_panel, df=4)
survsim.plot(sim_realVAF_panel, df=5)
```

```{r}
coxph(Surv(y, failed) ~ ., data=sim_realVAF_panel[[1]]$data)
head(sim_realVAF_panel[[1]]$betas, 10)
```

```{r}
ggplot() + geom_histogram(aes(x=sim_realVAF_panel[[1]]$betas)) + ggtitle("Distribution of 'true' coefficients of simulated real panel with VAFs")
```
## Reduced VAF panel data based on selecting only variables with "high" coefficients
These are still quite low in terms of coefficient, so we may need another approach (create smaller lists of variables and then pad them with totally random variables)
```{r}
sim_realVAF_panel_reduc <- list()
```

```{r}
sim_realVAF_panel_reduc[[1]] <- sim_realVAF_panel[[1]]$data[,abs(sim_realVAF_panel[[1]]$betas) > 0.15]
sim_realVAF_panel_reduc[[1]]$y <- sim_realVAF_panel[[1]]$data$y
sim_realVAF_panel_reduc[[1]]$failed <- sim_realVAF_panel[[1]]$data$failed

sim_realVAF_panel_reduc[[2]] <- sim_realVAF_panel[[2]]$data[,abs(sim_realVAF_panel[[2]]$betas) > 0.15]
sim_realVAF_panel_reduc[[2]]$y <- sim_realVAF_panel[[2]]$data$y
sim_realVAF_panel_reduc[[2]]$failed <- sim_realVAF_panel[[2]]$data$failed

sim_realVAF_panel_reduc[[3]] <- sim_realVAF_panel[[3]]$data[,abs(sim_realVAF_panel[[3]]$betas) > 0.15]
sim_realVAF_panel_reduc[[3]]$y <- sim_realVAF_panel[[3]]$data$y
sim_realVAF_panel_reduc[[3]]$failed <- sim_realVAF_panel[[3]]$data$failed

sim_realVAF_panel_reduc[[4]] <- sim_realVAF_panel[[4]]$data[,abs(sim_realVAF_panel[[4]]$betas) > 0.15]
sim_realVAF_panel_reduc[[4]]$y <- sim_realVAF_panel[[4]]$data$y
sim_realVAF_panel_reduc[[4]]$failed <- sim_realVAF_panel[[4]]$data$failed

sim_realVAF_panel_reduc[[5]] <- sim_realVAF_panel[[5]]$data[,abs(sim_realVAF_panel[[5]]$betas) > 0.15]
sim_realVAF_panel_reduc[[5]]$y <- sim_realVAF_panel[[5]]$data$y
sim_realVAF_panel_reduc[[5]]$failed <- sim_realVAF_panel[[5]]$data$failed
```

## Positive control (small number of continuous variables)
Since these seem to behave poorly in the random survival forest, let's take a different approach and build a minimal positive control that we can then modify to test modeling tolerances.

### Generate binary dataset
1000 samples with 20 gene variables should lead to relatively high coefficients.
```{r}
set.seed(12345)
pos_ctrl_input <- binary_generator(samp_num = 1000, var_num = 10, var_freq = c(5,40))
pos_ctrl_input <- apply(pos_ctrl_input, 1:2, function(x) x*(sample(5:100,1)/100))
head(pos_ctrl_input)
```


### Generate 5 simulated survival datasets
```{r}
sim_pos_ctrl <- sim.survdata(T=1825, N=1000, num.data.frames = 5, censor=0.5, knots=37, X=pos_ctrl_input, beta = c(0,-1, 1, -2, 2, -3, 3, -4, 4, 5))
```

### Diagnostics for simulated data
```{r}
survsim.plot(sim_pos_ctrl, df=1)
survsim.plot(sim_pos_ctrl, df=2)
survsim.plot(sim_pos_ctrl, df=3)
survsim.plot(sim_pos_ctrl, df=4)
survsim.plot(sim_pos_ctrl, df=5)
```

```{r}
coxph(Surv(y, failed) ~ ., data=sim_pos_ctrl[[1]]$data)
head(sim_pos_ctrl[[1]]$betas, 10)
```

```{r}
ggplot() + geom_histogram(aes(x=sim_pos_ctrl[[1]]$betas)) + ggtitle("Distribution of 'true' coefficients of simulated VAF positive control")
ggplot() + geom_histogram(aes(x=sim_pos_ctrl[[2]]$betas)) + ggtitle("Distribution of 'true' coefficients of simulated VAF positive control")
ggplot() + geom_histogram(aes(x=sim_pos_ctrl[[3]]$betas)) + ggtitle("Distribution of 'true' coefficients of simulated VAF positive control")
ggplot() + geom_histogram(aes(x=sim_pos_ctrl[[4]]$betas)) + ggtitle("Distribution of 'true' coefficients of simulated VAF positive control")
ggplot() + geom_histogram(aes(x=sim_pos_ctrl[[5]]$betas)) + ggtitle("Distribution of 'true' coefficients of simulated VAF positive control")
```

## Positive control
Let's clean up this positive control a little bit. I think it makes sense to find the "minimum positive control" we can use. 10 variables seems a decent number, but let's see how big the coefficients need to be in order for it to behave well. Maybe we can cut down on the number of aberrant samples at the topend (the warnings thrown by the function).

### Generate data
```{r}
sim_pos_ctrl <- sim.survdata(T=1825, N=1000, num.data.frames = 5, censor=0.5, knots=37, X=pos_ctrl_input, beta = c(0,-0.25, 0.25, -0.5, 0.5, -0.75, 0.75, -1, 1, 2))

```

### Diagnostics for simulated data
```{r}
survsim.plot(sim_pos_ctrl, df=1)
survsim.plot(sim_pos_ctrl, df=2)
survsim.plot(sim_pos_ctrl, df=3)
survsim.plot(sim_pos_ctrl, df=4)
survsim.plot(sim_pos_ctrl, df=5)
```

```{r}
coxph(Surv(y, failed) ~ ., data=sim_pos_ctrl[[1]]$data)
head(sim_pos_ctrl[[1]]$betas, 10)
```

We have an interesting conundrum here. If we generate coefficients less than 1, we are able to reduce the "blowout" that happens as the survival simulator tries to meet the coefficient requirements with zero-inflated data. However, as coefficients decrease below 1, we see highly correlated performance loss in the RSF

## Fixed positive control
The variable positive control seems to work okay but we see an interaction in random survival forest accuracy based on frequency and coefficient. Now what we need to do is generate several datasets with fixed frequency and vary the coefficients, essentially generating a matrix of frequency & coeff variation and then we can track performance. Let's do 10, 20, 40, 60, 80, and 100 percent mutation prevalence.

### Generate binary dataset
Generate 1000 samples with 60 variables ranging in prevalence from 10% to 100%
```{r}
set.seed(12345)
prev_coeff_10 <- binary_generator(samp_num = 1000, var_num = 10, var_freq = 10)
prev_coeff_10 <- apply(prev_coeff_10, 1:2, function(x) x*(sample(5:100,1)/100))

prev_coeff_20 <- binary_generator(samp_num = 1000, var_num = 10, var_freq = 20)
prev_coeff_20 <- apply(prev_coeff_20, 1:2, function(x) x*(sample(5:100,1)/100))

prev_coeff_40 <- binary_generator(samp_num = 1000, var_num = 10, var_freq = 40)
prev_coeff_40 <- apply(prev_coeff_40, 1:2, function(x) x*(sample(5:100,1)/100))

prev_coeff_60 <- binary_generator(samp_num = 1000, var_num = 10, var_freq = 60)
prev_coeff_60 <- apply(prev_coeff_60, 1:2, function(x) x*(sample(5:100,1)/100))

prev_coeff_80 <- binary_generator(samp_num = 1000, var_num = 10, var_freq = 80)
prev_coeff_80 <- apply(prev_coeff_80, 1:2, function(x) x*(sample(5:100,1)/100))

prev_coeff_100 <- binary_generator(samp_num = 1000, var_num = 10, var_freq = 100)
prev_coeff_100 <- apply(prev_coeff_100, 1:2, function(x) x*(sample(5:100,1)/100))
```

```{r}
freq_coeff_mat_input <- cbind(prev_coeff_10, prev_coeff_20, prev_coeff_40, prev_coeff_60, prev_coeff_80, prev_coeff_100)
freq_coeff_mat_input <- as.data.frame(freq_coeff_mat_input)
```

```{r}
colnames(freq_coeff_mat_input) <- c("F10_C0", "F10_Cn.25", "F10_C.25", "F10_Cn.5", "F10_C.5", "F10_Cn.75", "F10_C.75", "F10_Cn1", "F10_C1", "F10_C2","F20_C0", "F20_Cn.25", "F20_C.25", "F20_Cn.5", "F20_C.5", "F20_Cn.75", "F20_C.75", "F20_Cn1", "F20_C1", "F20_C2","F40_C0", "F40_Cn.25", "F40_C.25", "F40_Cn.5", "F40_C.5", "F40_Cn.75", "F40_C.75", "F40_Cn1", "F40_C1", "F40_C2","F60_C0", "F60_Cn.25", "F60_C.25", "F60_Cn.5", "F60_C.5", "F60_Cn.75", "F60_C.75", "F60_Cn1", "F60_C1", "F60_C2","F80_C0", "F80_Cn.25", "F80_C.25", "F80_Cn.5", "F80_C.5", "F80_Cn.75", "F80_C.75", "F80_Cn1", "F80_C1", "F80_C2","F100_C0", "F100_Cn.25", "F100_C.25", "F100_Cn.5", "F100_C.5", "F100_Cn.75", "F100_C.75", "F100_Cn1", "F100_C1", "F100_C2")
```

### Generate survival data
```{r}
freq_coeff_mat <- sim.survdata(T=1825, N=1000, num.data.frames = 5, censor=0.5, knots=37, X=freq_coeff_mat_input, beta = rep(c(0,-0.25, 0.25, -0.5, 0.5, -0.75, 0.75, -1, 1, 2), 6))

```

### Diagnostics for simulated data
```{r}
survsim.plot(freq_coeff_mat, df=1)
survsim.plot(freq_coeff_mat, df=2)
survsim.plot(freq_coeff_mat, df=3)
survsim.plot(freq_coeff_mat, df=4)
survsim.plot(freq_coeff_mat, df=5)
```

Based on the warning, samples 2, 3, and 4 seem to have the best results, but this dataset is exceedingly weird, probably because so much of it is forced into a particular pairing of coefficients with frequency. Even samples 2, 3, and 4 appear highly skewed toward the earlier timepoints. This is not necessarily bad, but we may need to note it for future investigation.

---------------------------------------------

# Results
## Unrealistic genetic data randomForestSRC workflow using default params
### Tuning data
```{r}
tune_genepanel_1 <- tune(Surv(y, failed) ~ ., sim_genepanel[[1]]$data)
tune_genepanel_2 <- tune(Surv(y, failed) ~ ., sim_genepanel[[2]]$data)
tune_genepanel_3 <- tune(Surv(y, failed) ~ ., sim_genepanel[[3]]$data)
tune_genepanel_4 <- tune(Surv(y, failed) ~ ., sim_genepanel[[4]]$data)
tune_genepanel_5 <- tune(Surv(y, failed) ~ ., sim_genepanel[[5]]$data)
```

```{r}
message("Model 1")
plot.tune(tune_genepanel_1)
tune_genepanel_1$optimal
message("Model 2")
plot.tune(tune_genepanel_2)
tune_genepanel_2$optimal
message("Model 3")
plot.tune(tune_genepanel_3)
tune_genepanel_3$optimal
message("Model 4")
plot.tune(tune_genepanel_4)
tune_genepanel_4$optimal
message("Model 5")
plot.tune(tune_genepanel_5)
tune_genepanel_5$optimal

```

### Training data for 50/50 frequency data (unrealistic)
```{r}
rfsrc_genepanel_1 <- rfsrc(Surv(y, failed) ~ ., sim_genepanel[[1]]$data, ntree = 4000, nodesize = 35, mtry=201, importance = TRUE)
rfsrc_genepanel_2 <- rfsrc(Surv(y, failed) ~ ., sim_genepanel[[2]]$data, ntree = 4000, nodesize = 25, mtry=161, importance = TRUE)
rfsrc_genepanel_3 <- rfsrc(Surv(y, failed) ~ ., sim_genepanel[[3]]$data, ntree = 4000, nodesize = 8, mtry=104, importance = TRUE)
rfsrc_genepanel_4 <- rfsrc(Surv(y, failed) ~ ., sim_genepanel[[4]]$data, ntree = 4000, nodesize = 65, mtry=488, importance = TRUE)
rfsrc_genepanel_5 <- rfsrc(Surv(y, failed) ~ ., sim_genepanel[[5]]$data, ntree = 4000, nodesize = 60, mtry=251, importance = TRUE)
```

```{r}
rfsrc_genepanel_1 
rfsrc_genepanel_2 
rfsrc_genepanel_3 
rfsrc_genepanel_4 
rfsrc_genepanel_5 

plot(rfsrc_genepanel_1)
plot(gg_rfsrc(rfsrc_genepanel_1))
plot(rfsrc_genepanel_2)
plot(gg_rfsrc(rfsrc_genepanel_2))
plot(rfsrc_genepanel_3)
plot(gg_rfsrc(rfsrc_genepanel_3))
plot(rfsrc_genepanel_4)
plot(gg_rfsrc(rfsrc_genepanel_4))
plot(rfsrc_genepanel_5)
plot(gg_rfsrc(rfsrc_genepanel_5))
```
These data seem to indicate that for the unrealistic case of balanced binary variables with Cox coefficients normally distributed and maxing out around +/- 0.3, most models will converge by 2000 trees, which will definitely save time simulating as we get deeper into this study. Interestingly, although the coefficients for the variables are normally distributed, variable importance is heavily weighted toward positive importance, and it is not normally distributed.

Now let's try working with a more realistic example. We'll use 500 genes, and they'll have a range of prevalence of mutations from 5% of patients to 40% of patients. This distribution is a little broad to cover most known gene prevalence in cancer, except in cases of cancer with single causative mutations.

## More realistic gene panel results
### Tuning data
```{r}
tune_realpanel_1 <- tune(Surv(y, failed) ~ ., sim_realpanel[[1]]$data)
tune_realpanel_2 <- tune(Surv(y, failed) ~ ., sim_realpanel[[2]]$data)
tune_realpanel_3 <- tune(Surv(y, failed) ~ ., sim_realpanel[[3]]$data)
tune_realpanel_4 <- tune(Surv(y, failed) ~ ., sim_realpanel[[4]]$data)
tune_realpanel_5 <- tune(Surv(y, failed) ~ ., sim_realpanel[[5]]$data)
```

```{r}
message("Model 1")
plot.tune(tune_realpanel_1)
tune_realpanel_1$optimal
message("Model 2")
plot.tune(tune_realpanel_2)
tune_realpanel_2$optimal
message("Model 3")
plot.tune(tune_realpanel_3)
tune_realpanel_3$optimal
message("Model 4")
plot.tune(tune_realpanel_4)
tune_realpanel_4$optimal
message("Model 5")
plot.tune(tune_realpanel_5)
tune_realpanel_5$optimal

```

### Training data for 50/50 frequency data (realistic)
```{r}
rfsrc_realpanel_1 <- rfsrc(Surv(y, failed) ~ ., sim_realpanel[[1]]$data, ntree = 2000, nodesize = 9, mtry=300, importance = TRUE)
rfsrc_realpanel_2 <- rfsrc(Surv(y, failed) ~ ., sim_realpanel[[2]]$data, ntree = 2000, nodesize = 55, mtry=293, importance = TRUE)
rfsrc_realpanel_3 <- rfsrc(Surv(y, failed) ~ ., sim_realpanel[[3]]$data, ntree = 2000, nodesize = 7, mtry=188, importance = TRUE)
rfsrc_realpanel_4 <- rfsrc(Surv(y, failed) ~ ., sim_realpanel[[4]]$data, ntree = 2000, nodesize = 25, mtry=63, importance = TRUE)
rfsrc_realpanel_5 <- rfsrc(Surv(y, failed) ~ ., sim_realpanel[[5]]$data, ntree = 2000, nodesize = 25, mtry=151, importance = TRUE)
```

```{r}
rfsrc_realpanel_1 
rfsrc_realpanel_2 
rfsrc_realpanel_3 
rfsrc_realpanel_4 
rfsrc_realpanel_5 

plot(rfsrc_realpanel_1)
plot(gg_rfsrc(rfsrc_realpanel_1))
plot(rfsrc_realpanel_2)
plot(gg_rfsrc(rfsrc_realpanel_2))
plot(rfsrc_realpanel_3)
plot(gg_rfsrc(rfsrc_realpanel_3))
plot(rfsrc_realpanel_4)
plot(gg_rfsrc(rfsrc_realpanel_4))
plot(rfsrc_realpanel_5)
plot(gg_rfsrc(rfsrc_realpanel_5))
```

Interesting. So that still doesn't work worth a hoot for binary data. Now we have two ways to go: we can try the same type of pipeline but for smaller numbers of genes, or we can try changing to VAF and seeing how the continuous data preferences of random forests translate to RSF. Let's try using the same matrix but converting the "1" values to continuous values between 0.1 and 1.

## Realistic gene sequencing panel data using VAFs, with the previous matrix converted to VAF
I am somewhat concerned here that the coefficients from the continuous variables are not high enough (mostly < abs(0.2)), but we will see if the models can derive anything useful from them. We may have to manually set the coefficients for these data. But first let's do a baseline, then we can work on setting fixed coefficients and seeing if filtering low coefficients from Cox PH can help.

### Tuning data
```{r}
tune_VAFpanel_1 <- tune(Surv(y, failed) ~ ., sim_realVAF_panel[[1]]$data)
tune_VAFpanel_2 <- tune(Surv(y, failed) ~ ., sim_realVAF_panel[[2]]$data)
tune_VAFpanel_3 <- tune(Surv(y, failed) ~ ., sim_realVAF_panel[[3]]$data)
tune_VAFpanel_4 <- tune(Surv(y, failed) ~ ., sim_realVAF_panel[[4]]$data)
tune_VAFpanel_5 <- tune(Surv(y, failed) ~ ., sim_realVAF_panel[[5]]$data)
```

```{r}
message("Model 1")
plot.tune(tune_VAFpanel_1)
tune_VAFpanel_1$optimal
message("Model 2")
plot.tune(tune_VAFpanel_2)
tune_VAFpanel_2$optimal
message("Model 3")
plot.tune(tune_VAFpanel_3)
tune_VAFpanel_3$optimal
message("Model 4")
plot.tune(tune_VAFpanel_4)
tune_VAFpanel_4$optimal
message("Model 5")
plot.tune(tune_VAFpanel_5)
tune_VAFpanel_5$optimal

```

### Training using VAF data derived from same binary data
```{r}
rfsrc_VAFpanel_1 <- rfsrc(Surv(y, failed) ~ ., sim_realVAF_panel[[1]]$data, ntree = 2000, nodesize = 30, mtry=51, importance = TRUE)
rfsrc_VAFpanel_2 <- rfsrc(Surv(y, failed) ~ ., sim_realVAF_panel[[2]]$data, ntree = 2000, nodesize = 75, mtry=51, importance = TRUE)
rfsrc_VAFpanel_3 <- rfsrc(Surv(y, failed) ~ ., sim_realVAF_panel[[3]]$data, ntree = 2000, nodesize = 25, mtry=33, importance = TRUE)
rfsrc_VAFpanel_4 <- rfsrc(Surv(y, failed) ~ ., sim_realVAF_panel[[4]]$data, ntree = 2000, nodesize = 35, mtry=63, importance = TRUE)
rfsrc_VAFpanel_5 <- rfsrc(Surv(y, failed) ~ ., sim_realVAF_panel[[5]]$data, ntree = 2000, nodesize = 50, mtry=51, importance = TRUE)
```

```{r}
rfsrc_VAFpanel_1 
rfsrc_VAFpanel_2 
rfsrc_VAFpanel_3 
rfsrc_VAFpanel_4 
rfsrc_VAFpanel_5 

plot(rfsrc_VAFpanel_1)
plot(gg_rfsrc(rfsrc_VAFpanel_1))
plot(rfsrc_VAFpanel_2)
plot(gg_rfsrc(rfsrc_VAFpanel_2))
plot(rfsrc_VAFpanel_3)
plot(gg_rfsrc(rfsrc_VAFpanel_3))
plot(rfsrc_VAFpanel_4)
plot(gg_rfsrc(rfsrc_VAFpanel_4))
plot(rfsrc_VAFpanel_5)
plot(gg_rfsrc(rfsrc_VAFpanel_5))
```

## Realistic gene sequencing panel data using VAFs, selected down based on coefficient
The performance for the previous models was relatively abysmal (rarely above 0.45% accuracy despite Cox capturing nearly all of the variability). Let's see if we can get better performance by forcibly reducing the data down to only the highest coefficients. Since most coefficients are below abs(1.5), let's try that first.

### Tuning data
```{r}
tune_VAFpanel_reduc_1 <- tune(Surv(y, failed) ~ ., as.data.frame(sim_realVAF_panel_reduc[[1]]))
tune_VAFpanel_reduc_2 <- tune(Surv(y, failed) ~ ., as.data.frame(sim_realVAF_panel_reduc[[2]]))
tune_VAFpanel_reduc_3 <- tune(Surv(y, failed) ~ ., as.data.frame(sim_realVAF_panel_reduc[[3]]))
tune_VAFpanel_reduc_4 <- tune(Surv(y, failed) ~ ., as.data.frame(sim_realVAF_panel_reduc[[4]]))
tune_VAFpanel_reduc_5 <- tune(Surv(y, failed) ~ ., as.data.frame(sim_realVAF_panel_reduc[[5]]))
```

```{r}
message("Model 1")
plot.tune(tune_VAFpanel_reduc_1)
tune_VAFpanel_reduc_1$optimal
message("Model 2")
plot.tune(tune_VAFpanel_reduc_2)
tune_VAFpanel_reduc_2$optimal
message("Model 3")
plot.tune(tune_VAFpanel_reduc_3)
tune_VAFpanel_reduc_3$optimal
message("Model 4")
plot.tune(tune_VAFpanel_reduc_4)
tune_VAFpanel_reduc_4$optimal
message("Model 5")
plot.tune(tune_VAFpanel_reduc_5)
tune_VAFpanel_reduc_5$optimal

```

### Training using VAF data derived from same binary data
```{r}
rfsrc_VAFpanel_reduc_1 <- rfsrc(Surv(y, failed) ~ ., sim_realVAF_panel_reduc[[1]], ntree = 2000, nodesize = 8, mtry=7, importance = TRUE)
rfsrc_VAFpanel_reduc_2 <- rfsrc(Surv(y, failed) ~ ., sim_realVAF_panel_reduc[[2]], ntree = 2000, nodesize = 50, mtry=8, importance = TRUE)
rfsrc_VAFpanel_reduc_3 <- rfsrc(Surv(y, failed) ~ ., sim_realVAF_panel_reduc[[3]], ntree = 2000, nodesize = 75, mtry=7, importance = TRUE)
rfsrc_VAFpanel_reduc_4 <- rfsrc(Surv(y, failed) ~ ., sim_realVAF_panel_reduc[[4]], ntree = 2000, nodesize = 35, mtry=8, importance = TRUE)
rfsrc_VAFpanel_reduc_5 <- rfsrc(Surv(y, failed) ~ ., sim_realVAF_panel_reduc[[5]], ntree = 2000, nodesize = 20, mtry=9, importance = TRUE)
```

```{r}
rfsrc_VAFpanel_reduc_1 
rfsrc_VAFpanel_reduc_2 
rfsrc_VAFpanel_reduc_3 
rfsrc_VAFpanel_reduc_4 
rfsrc_VAFpanel_reduc_5 

plot(rfsrc_VAFpanel_reduc_1)
plot(gg_rfsrc(rfsrc_VAFpanel_reduc_1))
plot(rfsrc_VAFpanel_reduc_2)
plot(gg_rfsrc(rfsrc_VAFpanel_reduc_2))
plot(rfsrc_VAFpanel_reduc_3)
plot(gg_rfsrc(rfsrc_VAFpanel_reduc_3))
plot(rfsrc_VAFpanel_reduc_4)
plot(gg_rfsrc(rfsrc_VAFpanel_reduc_4))
plot(rfsrc_VAFpanel_reduc_5)
plot(gg_rfsrc(rfsrc_VAFpanel_reduc_5))
```

Interestingly, it seems like even if we select down to just the highest absolute values of coefficients (roughly 50 vars per model), it's still not enough at these coefficient levels. We may need much higher HRs among variables to detect them successfully. This was a rather naive approach in retrospect. A better approach will probably be to find a positive control and then fiddle "upward" until I find the tolerances of the random survival forest.

## Positive control models
We'll try generating positive controls that mimic real-world genetic data using VAFs and see if we can actually decently model them with a minimum number of variables. One would expect that 20 random variables would be sufficient to generate good coefficients, but many of these coefficients seem relatively low. We'll see what we can do with this, but I remain less than hopeful.

### Tuning data
```{r}
tune_posctrl_1 <- tune(Surv(y, failed) ~ ., sim_pos_ctrl[[1]]$data)
tune_posctrl_2 <- tune(Surv(y, failed) ~ ., sim_pos_ctrl[[2]]$data)
tune_posctrl_3 <- tune(Surv(y, failed) ~ ., sim_pos_ctrl[[3]]$data)
tune_posctrl_4 <- tune(Surv(y, failed) ~ ., sim_pos_ctrl[[4]]$data)
tune_posctrl_5 <- tune(Surv(y, failed) ~ ., sim_pos_ctrl[[5]]$data)
```

```{r}
message("Model 1")
plot.tune(tune_posctrl_1)
tune_posctrl_1$optimal
message("Model 2")
plot.tune(tune_posctrl_2)
tune_posctrl_2$optimal
message("Model 3")
plot.tune(tune_posctrl_3)
tune_posctrl_3$optimal
message("Model 4")
plot.tune(tune_posctrl_4)
tune_posctrl_4$optimal
message("Model 5")
plot.tune(tune_posctrl_5)
tune_posctrl_5$optimal

```

### Training using VAF data derived from same binary data
```{r}
rfsrc_posctrl_1 <- rfsrc(Surv(y, failed) ~ ., sim_pos_ctrl[[1]]$data, ntree = 2000, nodesize = 60, mtry=2, importance = TRUE)
rfsrc_posctrl_2 <- rfsrc(Surv(y, failed) ~ ., sim_pos_ctrl[[2]]$data, ntree = 2000, nodesize = 70, mtry=4, importance = TRUE)
rfsrc_posctrl_3 <- rfsrc(Surv(y, failed) ~ ., sim_pos_ctrl[[3]]$data, ntree = 2000, nodesize = 40, mtry=3, importance = TRUE)
rfsrc_posctrl_4 <- rfsrc(Surv(y, failed) ~ ., sim_pos_ctrl[[4]]$data, ntree = 2000, nodesize = 75, mtry=1, importance = TRUE)
rfsrc_posctrl_5 <- rfsrc(Surv(y, failed) ~ ., sim_pos_ctrl[[5]]$data, ntree = 2000, nodesize = 55, mtry=2, importance = TRUE)
```

```{r}
rfsrc_posctrl_1 
rfsrc_posctrl_2 
rfsrc_posctrl_3 
rfsrc_posctrl_4 
rfsrc_posctrl_5 

plot(rfsrc_posctrl_1)
plot(gg_rfsrc(rfsrc_posctrl_1))
plot(rfsrc_posctrl_2)
plot(gg_rfsrc(rfsrc_posctrl_2))
plot(rfsrc_posctrl_3)
plot(gg_rfsrc(rfsrc_posctrl_3))
plot(rfsrc_posctrl_4)
plot(gg_rfsrc(rfsrc_posctrl_4))
plot(rfsrc_posctrl_5)
plot(gg_rfsrc(rfsrc_posctrl_5))
```

Well, this appears promising, but it is also appears to show that the plots for 1000 patients seems to be a bit misleading because the resolution is simply too low. Given the accuracy of the model, it appears that the lower events may represent patients who experience relapse late in their disease course. In any case, let's try it with something more approximating genetic data.

It seems the source of the problems may be the zero-inflation of genetic data. But we need a way to get around that if possible. The sparsity means rather low coefficients for survival in most cases.

If we reduce the coefficients to minimize the "blowout", we also see a directly correlated loss in OOB performance. Interestingly, most of the RSF models seem to favor higher coefficients in importance, but there appears to be some interaction between prevalence and coefficient. Even though F10 has the highest coefficient in this model, it also has the lowest prevalence, and F6-9 have lower coefficients but higher prevalence and are often ranked higher in variable importance. However, this could also be the effect of the zero-inflation.

## Freq x coeff models
We'll try generating positive controls that mimic real-world genetic data using VAFs and see if we can actually decently model them with a minimum number of variables. One would expect that 20 random variables would be sufficient to generate good coefficients, but many of these coefficients seem relatively low. We'll see what we can do with this, but I remain less than hopeful.


### Tuning data
```{r}
tune_fvc_1 <- tune(Surv(y, failed) ~ ., freq_coeff_mat[[1]]$data)
tune_fvc_2 <- tune(Surv(y, failed) ~ ., freq_coeff_mat[[2]]$data)
tune_fvc_3 <- tune(Surv(y, failed) ~ ., freq_coeff_mat[[3]]$data)
tune_fvc_4 <- tune(Surv(y, failed) ~ ., freq_coeff_mat[[4]]$data)
tune_fvc_5 <- tune(Surv(y, failed) ~ ., freq_coeff_mat[[5]]$data)
```

```{r}
message("Model 1")
plot.tune(tune_fvc_1)
tune_fvc_1$optimal
message("Model 2")
plot.tune(tune_fvc_2)
tune_fvc_2$optimal
message("Model 3")
plot.tune(tune_fvc_3)
tune_fvc_3$optimal
message("Model 4")
plot.tune(tune_fvc_4)
tune_fvc_4$optimal
message("Model 5")
plot.tune(tune_fvc_5)
tune_fvc_5$optimal
```

### Training using VAF data derived from same binary data
```{r}
rfsrc_fvc_1 <- rfsrc(Surv(y, failed) ~ ., freq_coeff_mat[[1]]$data, ntree = 2000, nodesize = 15, mtry=11, importance = TRUE)
rfsrc_fvc_2 <- rfsrc(Surv(y, failed) ~ ., freq_coeff_mat[[2]]$data, ntree = 2000, nodesize = 45, mtry=13, importance = TRUE)
rfsrc_fvc_3 <- rfsrc(Surv(y, failed) ~ ., freq_coeff_mat[[3]]$data, ntree = 2000, nodesize = 25, mtry=11, importance = TRUE)
rfsrc_fvc_4 <- rfsrc(Surv(y, failed) ~ ., freq_coeff_mat[[4]]$data, ntree = 2000, nodesize = 40, mtry=11, importance = TRUE)
rfsrc_fvc_5 <- rfsrc(Surv(y, failed) ~ ., freq_coeff_mat[[5]]$data, ntree = 2000, nodesize = 15, mtry=20, importance = TRUE)
```

```{r}
rfsrc_fvc_1 
rfsrc_fvc_2 
rfsrc_fvc_3 
rfsrc_fvc_4 
rfsrc_fvc_5 

plot(rfsrc_fvc_1)
plot(gg_rfsrc(rfsrc_fvc_1))
plot(rfsrc_fvc_2)
plot(gg_rfsrc(rfsrc_fvc_2))
plot(rfsrc_fvc_3)
plot(gg_rfsrc(rfsrc_fvc_3))
plot(rfsrc_fvc_4)
plot(gg_rfsrc(rfsrc_fvc_4))
plot(rfsrc_fvc_5)
plot(gg_rfsrc(rfsrc_fvc_5))
```
Variable importance calculations
```{r}
var_imp_test <- data.frame(varname=names(rfsrc_fvc_3[["importance"]]), importance = rfsrc_fvc_3[["importance"]])

var_imp_test %>% ggplot() + geom_bar(aes(x=reorder(varname, -importance), y=importance), stat="identity") + theme(axis.text.x = element_text(angle=90, hjust=1, vjust=0.5)) + labs(title="Variable Importance", x="Variables", y="Variable Importance - Permutation change in OOB")
```

Setup code to generate colorized heatmap of frequency versus coefficient
```{r}
fvc1 <- data.frame(varname=names(rfsrc_fvc_1[["importance"]]), importance = rfsrc_fvc_1[["importance"]])
fvc1$freq <- c(rep(10,10), rep(20,10), rep(40,10), rep(60,10), rep(80,10), rep(100,10))
fvc1$coeff <- rep(c(0,-0.25, 0.25, -0.5, 0.5, -0.75, 0.75, -1, 1, 2), 6)

fvc2 <- data.frame(varname=names(rfsrc_fvc_2[["importance"]]), importance = rfsrc_fvc_2[["importance"]])
fvc2$freq <- c(rep(10,10), rep(20,10), rep(40,10), rep(60,10), rep(80,10), rep(100,10))
fvc2$coeff <- rep(c(0,-0.25, 0.25, -0.5, 0.5, -0.75, 0.75, -1, 1, 2), 6)

fvc3 <- data.frame(varname=names(rfsrc_fvc_3[["importance"]]), importance = rfsrc_fvc_3[["importance"]])
fvc3$freq <- c(rep(10,10), rep(20,10), rep(40,10), rep(60,10), rep(80,10), rep(100,10))
fvc3$coeff <- rep(c(0,-0.25, 0.25, -0.5, 0.5, -0.75, 0.75, -1, 1, 2), 6)

fvc4 <- data.frame(varname=names(rfsrc_fvc_4[["importance"]]), importance = rfsrc_fvc_4[["importance"]])
fvc4$freq <- c(rep(10,10), rep(20,10), rep(40,10), rep(60,10), rep(80,10), rep(100,10))
fvc4$coeff <- rep(c(0,-0.25, 0.25, -0.5, 0.5, -0.75, 0.75, -1, 1, 2), 6)

fvc5 <- data.frame(varname=names(rfsrc_fvc_5[["importance"]]), importance = rfsrc_fvc_5[["importance"]])
fvc5$freq <- c(rep(10,10), rep(20,10), rep(40,10), rep(60,10), rep(80,10), rep(100,10))
fvc5$coeff <- rep(c(0,-0.25, 0.25, -0.5, 0.5, -0.75, 0.75, -1, 1, 2), 6)
```

```{r}
ggplot(fvc1) + geom_contour_filled(aes(x=freq, y=coeff, z=importance)) + ggtitle("Simulation 1") + labs(fill="Variable importance")
ggplot(fvc1) + geom_tile(aes(x=freq, y=coeff, fill=importance)) + scale_fill_gradient(low="yellow",high="blue") + ggtitle("Simulation 1")

ggplot(fvc2) + geom_contour_filled(aes(x=freq, y=coeff, z=importance))+ ggtitle("Simulation 2") + labs(fill="Variable importance")
ggplot(fvc2) + geom_tile(aes(x=freq, y=coeff, fill=importance)) + scale_fill_gradient(low="yellow",high="blue") + ggtitle("Simulation 2")

ggplot(fvc3) + geom_contour_filled(aes(x=freq, y=coeff, z=importance)) + ggtitle("Simulation 3") + labs(fill="Variable importance")
ggplot(fvc3) + geom_tile(aes(x=freq, y=coeff, fill=importance)) + scale_fill_gradient(low="yellow",high="blue") + ggtitle("Simulation 3")

ggplot(fvc4) + geom_contour_filled(aes(x=freq, y=coeff, z=importance)) + ggtitle("Simulation 4") + labs(fill="Variable importance")
ggplot(fvc4) + geom_tile(aes(x=freq, y=coeff, fill=importance)) + scale_fill_gradient(low="yellow",high="blue") + ggtitle("Simulation 4")

ggplot(fvc5) + geom_contour_filled(aes(x=freq, y=coeff, z=importance)) + ggtitle("Simulation 5") + labs(fill="Variable importance")
ggplot(fvc5) + geom_tile(aes(x=freq, y=coeff, fill=importance)) + scale_fill_gradient(low="yellow",high="blue") + ggtitle("Simulation 5")
```

## Investigate criteria for variable selection
There are some outstanding questions how frequency interacts with the Wald p-value and other calculations. Specifically, we would like to understand how frequency affects the coefficient and its p-value in univariate Cox maximum likelihood estimates.

We can better understand this by calculating the univariate Cox PH per variable per model and see if there are any important trends.

This code is modified from http://www.sthda.com/english/wiki/cox-proportional-hazards-model#compute-the-cox-model

```{r}
covariates <- colnames(freq_coeff_mat[[1]]$data[,1:60])
univ_formulas <- sapply(covariates,
                        function(x) as.formula(paste('Surv(y, failed)~', x)))
```

```{r}
fc1_cox_models <- lapply( univ_formulas, function(x){coxph(x, data = freq_coeff_mat[[1]]$data)})
# Extract data 
fc1_cox_results <- lapply(fc1_cox_models,
                       function(x){ 
                          x <- summary(x)
                          p.value<-signif(x$wald["pvalue"], digits=2)
                          wald.test<-signif(x$wald["test"], digits=2)
                          beta<-signif(x$coef[1], digits=2);#coeficient beta
                          HR <-signif(x$coef[2], digits=2);#exp(beta)
                          HR.confint.lower <- signif(x$conf.int[,"lower .95"], 2)
                          HR.confint.upper <- signif(x$conf.int[,"upper .95"],2)
                          HR <- paste0(HR, " (", 
                                       HR.confint.lower, "-", HR.confint.upper, ")")
                          res<-c(beta, HR, wald.test, p.value)
                          names(res)<-c("beta", "HR (95% CI for HR)", "wald.test", 
                                        "p.value")
                          return(res)
                          #return(exp(cbind(coef(x),confint(x))))
                         })
fc1_cox_results <- t(as.data.frame(fc1_cox_results, check.names = FALSE))
fc1_cox_results <- as.data.frame(fc1_cox_results)
```
```{r}
fc1_cox_results$freq <- c(rep(10,10), rep(20,10), rep(40,10), rep(60,10), rep(80,10), rep(100,10))
fc1_cox_results$coeff <- rep(c(0,-0.25, 0.25, -0.5, 0.5, -0.75, 0.75, -1, 1, 2), 6)
```

```{r}
ggplot(fc1_cox_results) + geom_text(aes(x=as.numeric(as.character(beta)), y=as.numeric(as.character(p.value)), label=rownames(fc1_cox_results)))

ggplot(fc1_cox_results) + geom_contour_filled(aes(x=freq, y=coeff, z=as.numeric(as.character(beta)))) + ggtitle("Simulation 1") + labs(fill="Cox Coeff")

ggplot(fc1_cox_results) + geom_tile(aes(x=freq, y=coeff, fill=as.numeric(as.character(beta)))) + scale_fill_gradient(low="yellow",high="blue") + ggtitle("Cox Coeff")
```

```{r}
ggplot(fc1_cox_results) + geom_contour_filled(aes(x=freq, y=coeff, z=as.numeric(as.character(p.value)))) + ggtitle("Simulation 1") + labs(fill="Wald P-value")

ggplot(fc1_cox_results) + geom_tile(aes(x=freq, y=coeff, fill=as.numeric(as.character(p.value)))) + scale_fill_gradient(low="blue",high="yellow") + ggtitle("Wald P-value")
```

# Conclusions
Random survival forests appear to show some variability in terms of effectively selecting simulated "genomic-like" data when variables are simulated to represent VAF scores. In variables with univariable Cox proportional hazards lower than 1, random survival forests are unable to effectively identify important variables.

Furthermore, zero inflation appears to be important for random survival forest performance. Genes that have simulated "mutational frequency" less than 20% appear to be less reliable than more frequently mutated genes.

Both of these results are almost certainly related to the splitting process in individual trees of the random survival forest. Most likely, the effects of rare and/or low-coefficient genes are so small that most of the weak predictors in the random survival forest ensemble are unable to effectively split the data.
